Atomic writes without tears

Posted Jun 3, 2024 18:19 UTC (Mon) by andresfreund (subscriber, #69562)
Parent article: Atomic writes without tears

> MySQL and PostgreSQL both use larger chunks, up to 16KB.

FWIW, with the default compilation options postgres just supports 8KB. With non-default options 1, 2, 4, 8, 16, 32 kB are supported for data pages (1-64 for WAL, but we don't need torn page protection there).

> The reason for caring about buffered I/O is because PostgreSQL uses it; depending on who you talk to, it will be three to six years before the database can switch to using direct I/O.

FWIW, if it were defined to be OK to use unbuffered IO for writes and buffered IO for reads, the timelines could be shortened considerably. There'd not be concurrent reads and writes to the same pages due to postgres level interlocking.

Realistically, there are workloads where we'll likely never be able to switch to unbuffered reads. Decent performance with unbuffered reads requires a well configured buffer pool size, which
a) a lot of users won't configure
b) doesn't work well with putting a lot of smaller postgres instances onto a system, which is common

> An attendee said that with buffered I/O, there is no way for the application to get an error if the write fails. Ts'o said that any error would come when fdatasync() is called, which the attendee called "buffered in name only".

I don't really understand the "buffered in name only" comment.

> Kara asked about the impact of these changes on the database code.

From what I can tell, it'd not be particularly hard for postgres. There's some nastiness around upgrading a database that hasn't been created using FORCEALIGN, and thus couldn't guarantee that all existing data files are aligned. To deal with that it'd be nice if we could specify on a per-write basis that we want an atomic write - if we get an error because of a file not being suitably aligned, we can fall back to the current method of protecting against torn writes.

> Ts'o said that he believes the PostgreSQL developers are looking forward to a solution, so they are willing to make the required changes when the time comes.

Indeed. Our torn-write protection often ends up being the majority of the WAL volume.

There's a small caveat in that we currently use full page writes to increase replay performance (we don't need to read the underlying data if we have the full new contents, thus the amount of ~random reads can be drastically reduced).

> They are likely to backport those changes to earlier versions of PostgreSQL as well.

I doubt we'd backport this to earlier PG versions - we tend to be very conservative with backporting features. With a very limited set of exceptions (e.g. to support easier/more efficient upgrades) we only backport bugfixes.

> Wilcox said that probably did not matter, because the older versions of PostgreSQL are running on neolithic kernels. Ts'o said that is not necessarily the case since some customers are willing to upgrade their kernel, but require sticking with the older database system.

Yea, that unfortunately is indeed common. It's not helped by the PG versions included in LTS distros being quite old. Lots of users get started on those and continue to use the PG version even when upgrading the OS.