A way to do atomic writes

Posted May 20, 2021 20:33 UTC (Thu) by aist (guest, #51495)
Parent article: A way to do atomic writes

Atomic operations on files don't have much sense now, because they imply two things:

1. The model of concurrency.
2. The system of relaxations of atomicity, to support different consistency/performance tradeoffs.

Atomicity is not cheap, and, what is much more important, it's not (easily) composable. Because of that, pushing high-level semantics down to hardware (disks) will not work as expected. Elementary (1-4 blocks) atomic operations are more than enough to support high-level composable atomic semantics across many disks. BUT, it's very hard to have high-level (generic) atomic writes which will be parallel. People, relational databases apparently provide pretty concurrent atomicity, but they rely on the fact that relational tables are unordered (sets or records), so it's relatively easy to merge multiple parallel versions of the same table into the single one. Merging of ordered data structures like vectors (and files) is not defined in general case (it's application-defined).

There are pretty good single-writer-multiple-readers (SWMR) schemes which are pretty light-weight, log-structured, wait-free and atomic for readers and writers (see LMDB for an example), but they are inherently single-writer. So, only one writer at a time can own the domain of atomicity (single file, directory, file-system). Readers are fully concurrent though with themselves and with writers. SWMR is best suitable for dynamic data analytics applications because of point-in-time semantics for readers (stable cursors, etc).

Multiple concurrent atomic writes (MWMR) are possible, but they are not that wait-free like SWMR, have much higher operational overhead and require atomic counters for (deterministic) garbage collection. And write-write conflict-resolution is application-defined. So, if we want an MWMR engine to be implemented at the device level, it will require pretty tight integration with applications, implying pretty complex APIs. Simply speaking, it isn't worth the efforts.

Log-structured SWMR/MWMR may work well with a single-block-scale atomic operations, they just need certain power-failure guaranties. They can be implemented entirely in the userspace as services on top of asynchronous IO interfaces like uring. Partial POSIX file API is emulation for legacy applications accessing the "atomic data" is also possible via FUSE.

Adding complex high-level atomic semantics (especially multi-operation commits) to the POSIX API will create much more problems than atomics are intended to solve.