LWN: Comments on "Atomic writes without tears"

Atomic writes without tears

zblaxell — Tue, 21 Jan 2025 21:33:27 +0000

> So using hardware still seems like more attractive proposition. Unless I am missing something.

If you only want the filesystem to get out of the way of writing directly to your block device, yes. But you are missing some things.

The average cost isn't necessarily the worst case cost. The checksum blocks can be amortized across multiple data blocks, if the application follows the pattern "it does multiple buffered writes, then calls fdatasync() to make that data persistent and to detect if there are any problems." Efficiency depends on the amount of data batched up before fdatasync. There's no rule that says the metadata write has to go to the same device as the data write in a cow filesystem, which may affect the cost analysis (i.e. even in the one-data-one-metadata-block-write case, one of the writes may be much cheaper than the other).

API-level compatibility is sometimes more important than the performance. Do we want to force application developers to implement two backends for cow- and non-cow filesystems, by failing in cases where there is no direct hardware support? Or do we prefer a model like "use a thin filesystem if the hardware supports atomic writes, or use a cow filesystem if not", and keep only the backend that uses atomic writes? If there's already an ioctl associated with this, it seems we could have one that says what we can expect from the backend (no support, emulated software support, full hardware support, or "one or more of the lower block layers does not implement this ioctl"). I don't expect PostgreSQL to drop non-atomic-write backends any time soon, but new projects might not bother supporting the old ways if some other part of the system will do the work for them.

Administrator controls are sometimes more important than performance. e.g. if we bought all the good hardware we can, and we still need to buy more, we can end up with some terrible storage devices. The sysadmin might force all applications to run on a cow filesystem with checksums so that device failures (particularly unreported data corruption) can be detected. That covers "data integrity is more important than performance" too.

Filesystems that have checksums can verify writes long after the fact, and cow filesystems can do write ordering cheaply. In some use cases that makes the fdatasync unnecessary (most of the same cases where async commit would be acceptable in the database). We'd still want to make sure the some obscure part of the kernel doesn't try to merge writes in a way that ends up tearing or reordering them before they get to the filesystem.

Atomic writes without tears

willy — Mon, 03 Jun 2024 19:23:53 +0000

I think Jake has written down exactly what I said, but that ends up being not what I meant without the surrounding context ;-) We were talking about drives which specify an AWU / AWUPF that is 16KiB (whether that's block-size 4KiB and an AWUPF value of 3 or block-size 512 and an AWUPF of 31).

Basically I was asking that since all implementations of this are cloud storage rather than drive firmware, whether we couldn't just have the cloud storage implement NVMe semantics even with the SCSI command set. Martin doesn't see the need, and since I haven't worked on NVMe in about 12 years, I'm deferring to his expertise.

Atomic writes without tears

andresfreund — Mon, 03 Jun 2024 18:19:43 +0000

> MySQL and PostgreSQL both use larger chunks, up to 16KB.

FWIW, with the default compilation options postgres just supports 8KB. With non-default options 1, 2, 4, 8, 16, 32 kB are supported for data pages (1-64 for WAL, but we don't need torn page protection there).

> The reason for caring about buffered I/O is because PostgreSQL uses it; depending on who you talk to, it will be three to six years before the database can switch to using direct I/O.

FWIW, if it were defined to be OK to use unbuffered IO for writes and buffered IO for reads, the timelines could be shortened considerably. There'd not be concurrent reads and writes to the same pages due to postgres level interlocking.

Realistically, there are workloads where we'll likely never be able to switch to unbuffered reads. Decent performance with unbuffered reads requires a well configured buffer pool size, which
a) a lot of users won't configure
b) doesn't work well with putting a lot of smaller postgres instances onto a system, which is common

> An attendee said that with buffered I/O, there is no way for the application to get an error if the write fails. Ts'o said that any error would come when fdatasync() is called, which the attendee called "buffered in name only".

I don't really understand the "buffered in name only" comment.

> Kara asked about the impact of these changes on the database code.

From what I can tell, it'd not be particularly hard for postgres. There's some nastiness around upgrading a database that hasn't been created using FORCEALIGN, and thus couldn't guarantee that all existing data files are aligned. To deal with that it'd be nice if we could specify on a per-write basis that we want an atomic write - if we get an error because of a file not being suitably aligned, we can fall back to the current method of protecting against torn writes.

> Ts'o said that he believes the PostgreSQL developers are looking forward to a solution, so they are willing to make the required changes when the time comes.

Indeed. Our torn-write protection often ends up being the majority of the WAL volume.

There's a small caveat in that we currently use full page writes to increase replay performance (we don't need to read the underlying data if we have the full new contents, thus the amount of ~random reads can be drastically reduced).

> They are likely to backport those changes to earlier versions of PostgreSQL as well.

I doubt we'd backport this to earlier PG versions - we tend to be very conservative with backporting features. With a very limited set of exceptions (e.g. to support easier/more efficient upgrades) we only backport bugfixes.

> Wilcox said that probably did not matter, because the older versions of PostgreSQL are running on neolithic kernels. Ts'o said that is not necessarily the case since some customers are willing to upgrade their kernel, but require sticking with the older database system.

Yea, that unfortunately is indeed common. It's not helped by the PG versions included in LTS distros being quite old. Lots of users get started on those and continue to use the PG version even when upgrading the OS.

Atomic writes without tears

andresfreund — Mon, 03 Jun 2024 18:02:34 +0000

> Matthew Wilcox noted that NVMe is specified to have 16KB tear boundaries; he wondered if the SCSI vendors could be convinced into doing something similar

I don't think NVMe in general guarantees anything up to that size, I suspect there might have been some miscommunication somewhere.

See 2.1.4.1 - 2.1.4.2.1 in
https://nvmexpress.org/wp-content/uploads/NVM-Express-NVM...

It imo is pretty clear that a 512 block formatted device may just guarantee that 512 bytes are written in a tear free manner. To guarantee more, awupf would need to be >= 1, which it very commonly isn't.

Atomic writes without tears

alkbyby — Sat, 01 Jun 2024 02:43:27 +0000

> COW filesystems can do this without device support, so we'll want to plumb this into bcachefs too

My understanding is that in order to support this filesystem either needs to append some sort of checksum to each write like this and "waste" full block per such write. Or do write of data, then fsync, then write of metadata, then fsync again. So kinda close to double-writes they're trying to optimize-away in DBMS-es. Either way, it costs more.

So using hardware still seems like more attractive proposition. Unless I am missing something.

Atomic writes without tears

jeremyhetzler — Mon, 27 May 2024 09:46:42 +0000

Really enjoyed this article title.

Atomic writes without tears

DemiMarie — Sun, 26 May 2024 04:15:08 +0000

Nice! Could this also be exposed via loop devices, so that e.g. VM guests can use it too?

Atomic writes without tears

koverstreet — Sat, 25 May 2024 00:19:10 +0000

COW filesystems can do this without device support, so we'll want to plumb this into bcachefs too