|
|
Log in / Subscribe / Register

Postgres, FPW=off and DIO

Postgres, FPW=off and DIO

Posted Apr 4, 2025 14:09 UTC (Fri) by andresfreund (subscriber, #69562)
Parent article: Supporting untorn buffered writes

Hi,

Are fpw=off numbers for Postgres numbers actually somehow using RWF_ATOMIC? Because if not, the performance comparisons seem fairly meaningless - all it'd be measuring is whether increased WAL volume has a performance impact - it obviously has.

RWF_ATOMIC does come with some overhead, AFAIU (on lots of devices FUA writes are slower).

FWIW, we (PG) finally merged AIO support recently. Albeit, for the next release, only with read support. Just ran out of time to solve some of the corner cases for asynchronous writes. I'm fairly certain that we can get write support done for the next version. While we added readahead for a lot of places in 17 and now in 18, there are still some important ones that matter, that's the big other missing piece.

You can already enable unbuffered IO, albeit for now with a flag intentionally intended to scare folks away (debug_io_direct=data).

However, I'm fairly certain that even once we fully support direct IO, a large number of folks will not be able to use it. To be sane to use DIO requires a much larger buffer pool than using buffered IO would. That's not viable for the large number of cases where postgres runs on shared hardware with either other software or other postgres instances. Which is unfortunately common.

With the WIP AIO write support we have seen rather significant performance wins with writing multiple buffers at once when possible (i.e. write combining up to 32 neighboring 8kB buffers with one vectored write). My understanding is that that won't realistically be compatible with the current RWF_ATOMIC semantics. We probably can determine whether to write combine based on whether we need torn-page protection, but it'll be a painful tradeoff.


to post comments

Postgres, FPW=off and DIO

Posted Apr 4, 2025 14:53 UTC (Fri) by mcgrof (subscriber, #25917) [Link]

We have no semantics today defined for buffered IO for RWF_ATOMIC, and so it can't be evaluated directly. At this stage the goal was to garner kernel community appreciation over it's potential, and discuss possible kernel level filesystem and block semantics. Since there seems to now be better appreciation over it's potential on the kernel, and the possible kernel semantics have been discussed the next goal would be to tailor a use case for databases that could leverage it such as PostgreSQL, and for that it's best to collaborate with the db community.

Its also correct that the RWF_ATOMIC atomic semantics today require single writes, that's not because of the requirements of direct IO but rather because at least from an NVMe perspective, a write IO size must not cross a boundary size, and if that's 16k an atomic write cannot be larger than 16k, ie it's a hardware requirement. And so software must also adhere to tailor atomic writes hardware needs, and the goal of RWF_ATOMIC is to help facilitate the requirements. Although NVMe MAM in theory could help large IO RWF_ATOMIC, and wrinkle to that it only works if a large write succeeds. If a large NVMe atomic MAM write fails filesystems today on Linux have no way of telling what block was valid and which is incorrect, the atomic block which failed is not communicated back. And so the entire range would need to be invalidated, which defeats the purpose. Reflininks *may* help here to support that limitation, but that'd require some evaluation and development.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds