LWN: Comments on "Support for atomic block writes in 6.13"

How does this work at a physical level?

sammythesnake — Tue, 18 Mar 2025 18:20:13 +0000

With transfer speeds of GiB/s (or even MiB/s if we're talking about the Good Old Days) a 4KiB sector will take bugger all time to write, so a supercap is colossal overkill for "finish the sector" power. Flushing the whole of a sizeable cache would likely be into the realm of a non-super capacitor, though, especially in the pathological case of the cached writes all being single sectors nowhere near each other...

How does this work at a physical level?

ras — Thu, 06 Mar 2025 08:32:14 +0000

Spinning Rust has always had to solve this problem so some extent. You can't have a 1/2 written block causing disk errors, so they ensured they had enough power around to complete writing a sector once it started. I'm not sure how they did it. One explanation I heard was they used rotating power in the platters. Extending that time with a super capacitor wouldn't be a big change.

How does this work at a physical level?

mebrown — Mon, 03 Mar 2025 17:09:30 +0000

No, it's not a double write.

The new data is written to a new block, then the mapping table entry is atomically switched so that the old data is unmapped/freed and the new data is swapped in.

How does this work at a physical level?

Paf — Mon, 24 Feb 2025 01:32:06 +0000

"Use the FTL's block mapping to allow you to atomically switch in a new mapping with a single bit write, and do not return that the command is complete until the new mapping is switched in. Then, you can write the new data and associated mapping, followed by the single bit write to switch the mapping over. If that bit write succeeds, the swap over is done; if it doesn't, the swap fails to happen."

I like that this is basically double writes - sometimes, it's turtles all the way down.

How does this work at a physical level?

kleptog — Sun, 23 Feb 2025 13:50:52 +0000

For non-SSDs you have hardware RAID controllers with a battery-backed memory that will hold the blocks waiting to be written and if the power fails it will keep the data and write it out when the power comes back.

How does this work at a physical level?

farnz — Sun, 23 Feb 2025 13:01:32 +0000

It doesn't need a capacitor, necessarily. There's two routes you can take in an SSD to enable this feature:

Have a capacitor or other energy store on the device, so that when power is lost, you can complete all the writes before the energy store drains.
Use the FTL's block mapping to allow you to atomically switch in a new mapping with a single bit write, and do not return that the command is complete until the new mapping is switched in. Then, you can write the new data and associated mapping, followed by the single bit write to switch the mapping over. If that bit write succeeds, the swap over is done; if it doesn't, the swap fails to happen.

Capacitor is more likely, because it lets you have a write cache, too, and thus a performance advantage in enterprise drives. But it's possible without one in an SSD.

How does this work at a physical level?

KJ7RRV — Sun, 23 Feb 2025 07:25:48 +0000

That makes sense; thank you! So the drive has to be built with a capacitor to enable this feature?

How does this work at a physical level?

Cyberax — Sun, 23 Feb 2025 04:35:32 +0000

A capacitor can store enough charge to power the chip long enough to finish writing.

How does this work at a physical level?

KJ7RRV — Sun, 23 Feb 2025 02:37:27 +0000

How is an atomic write possible on physical media? Since writing data has to happen at a finite rate, thus taking nonzero time, isn't there always the possibility of abruptly losing power halfway through the write?

The real experience

butlerm — Fri, 21 Feb 2025 06:12:42 +0000

It should recover just fine as long as the sector or page size for the physical hardware is as large or in some cases larger than the database block size, and the drive in question can either write a full sector while spinning down after a power loss, or you have a battery backed cache on your drive controller, or your SSDs have capacitors big enough so that it can commit anything that is supposed to be forced to disk before it loses power, and of course your drive firmware does not have pathological bugs of the sort that used to be common in solid state drives for a while, or you have a reliabile UPS with appropriate shut everything down in case of power failure lasting more than X minutes software, or your datacenter has one that actually works.

Most physical spinning rust style hard drives that I am aware of these days have 4 KB sector sizes, and most SSDs have 128 KB page sizes. In the case of the former ideally you would have 4 KB database data block sizes as well, but most databases have been using a default block size of 8192 bytes or more for some time now so they presumably account for that possibility. And a typical way that is done is to store an adequate checksum and block id of some kind inline with the database data block so that a torn write or other unusual write failure or memory corruption can be detected when the block is next processed or read back in. Some filesystems like zfs of course do something quite similar, with block checksums or secure hashes stored with all or almost all internal block pointers. SHA-256 and CRC-32c are typically supported with CPU instructions on most modern enterprise class hardware so that isn't too difficult or too slow.

The real experience

Thalience — Fri, 21 Feb 2025 04:24:12 +0000

Consumer-grade nvme devices (that are already employing a flash-translation layer) have no reason not to do an atomic update when the spec says they should.

But it no-longer surprises me when exposing an api contract to wider testing also exposes some vendors/devices as doing the wrong thing for no good reason.

The real experience

willy — Fri, 21 Feb 2025 04:21:13 +0000

Well. This feature is being developed for systems which are sold to you as an appliance. Your vendor has tested the feature works with the certified drives they sell to you with it.

If you're using consumer grade hardware, then you have to either take the manufacturer's word for it or test it yourself.

Most consumer grade hardware does not advertise that it supports multi-block atomicity though, so the question doesn't arise.

The real experience

ikm — Fri, 21 Feb 2025 02:10:33 +0000

Do those things work in practice? I mean, safe recovery from a physical power failure, ideally on a consumer-grade hardware. Every time this happens to me, I can't help but pray that the database *will* actually recover, even if in theory it must...

Why write twice

butlerm — Thu, 20 Feb 2025 22:21:23 +0000

If you have atomic block writes you do not need to write two different versions of the block in a "double write". That said, most databases use a redo log and block updates are committed to redo a long time before data blocks are updated in a checkpoint of some sort.

Filesystems like the NetApp filesystem, zfs, and btrfs use a phase tree approach that is not yet common for (relational) databases. And one of the reasons it is not common is because most relational databases were and are designed to give acceptable performance on spinning rust type hard disks and internal pointers in the database - in database indexes in particular - in most designs need to translate to the physical address of the referenced block without doing any additional physical disk I/O.

That means that they run more efficiently on direct mapped non-phase tree filesystems like xfs, ext4, or ntfs if not actual raw disk devices, which used to be quite common in some environments. If you put a typical relational database on a filesystem like btrfs or zfs it will slow down dramatically for that reason. It can be done of course, especially with something like zfs, but most people don't do it. That goes for Oracle, MySQL, PostgreSQL, DB2, MS SQL Server, Sybase, and a number of other older relational databases that are not so popular anymore.

If you want to design a relational or similar database to use a phase tree approach internally the place you probably ought to start is with typical B-tree or hash indexes, which are already multiversioned in most designs, and sometimes with versions that last for a considerable amount of time to do index rebuilds without taking a table offline.

And although it is usually slower it is possible to store primary key references instead pf physical or logical database block references in secondary indexes and use an index organized table that basically puts the row data in the leaves of a B-tree or similar tree that is ordinarily only accessed by primary key value instead of by something like (datafile, block, row). Then of course it doesn't really matter if data blocks and rows have several versions at new file / block offsets because the database would not generally access them by file / block / row number anyway except at a very low level.

PostgreSQL might be more amenable to this because if I recall correctly Postgres table data is index organized already and old row versions are stored inline and have to be vacuumed or essentially garbage collected later. Oracle stores old row versions to allow multiversion read consistency in a separate areas referred to as "rollback segments", which is one of the reasons why although it supports very long running transactions it originally had a hard time keeping up with simpler non MVCC designs like DB2, Ingres, Informix, Sybase, and MS SQL, and (of course) MySQL especially before MySQL even had automated support for transactions and ACID properties in the first place. There was a major tradeoff there for years that was usually solved by throwing lots of disk spindles at the problem, like separate disks or RAID 1 mirrored pairs for different tablespaces, data files, index segments, rollback segments, control files, and redo logs.

Why write twice

jengelh — Thu, 20 Feb 2025 19:55:13 +0000

>To work around this possibility, databases employ an additional technique called "double write".

Why do databases (still) need a double write? E.g. git, btrfs, liblmdb, seem to do just fine with writing the data block and then (atomically) updating a pointer.

Atomic writes should be the default... ideally

iabervon — Thu, 20 Feb 2025 17:36:47 +0000

Most applications don't really want block-level atomicity properties; instead, they want whole-file atomicity, where the standard pattern is to write the new data to a new inode, and then make the filename refer to the new inode after the whole thing has been written. For a cat picture, if the file ends up containing partially old data and partially new data, it doesn't solve anything if the transitions are on block boundaries or not.

The thing that's special about databases isn't that their data is more precious, it's that small, well-defined parts of the file are being changed frequently and independently, so it's necessary to modify the stored data in place, and it's feasible and worthwhile to use a file structure where individual block changes correspond to valid files states.

large atomic writes for xfs

garrier79 — Thu, 20 Feb 2025 16:27:42 +0000

JFYI, latest support for large atomics on xfs is posted here... https://lore.kernel.org/linux-xfs/0f983090-4399-4cba-910d...

Atomic writes should be the default... ideally

meven-collabora — Thu, 20 Feb 2025 16:15:58 +0000

Any application if it has the opportunity would want to use atomic writes provided it isn't too much overhead neither too much hassle.
Like the direct I/O requirement and limited file size support of the current state.
Detecting hardware support is nicely exposed through statx already.
Hopefully this will benefit also regular user and mobile and not just database servers.

Cat pictures, and account spreadsheets are precious too.

Great work regardless.