Support for atomic block writes in 6.13
The Linux 6.13 merge window included a pull request from VFS maintainer Christian Brauner titled "vfs untorn writes", which added the initial atomic-write capability to the kernel. In this article, we will briefly cover what these atomic writes are, why they are important in database world, and what is currently supported in the 6.13 kernel.
To support atomic writes, changes were required across various layers of the Linux I/O stack. At the VFS level, an interface was introduced to allow applications to request atomic write I/O, along with enhancements to statx() to query atomic-write capabilities. Filesystems had to ensure that physical extent allocations were aligned to the underlying device's constraints, preventing extents from crossing atomic-write boundaries. For example, NVMe namespaces may define atomic boundaries; writes that straddle these boundaries will lose atomicity guarantees.
The block layer was updated to prevent the splitting of in-flight I/O operations for atomic write requests and to propagate the device constraints for atomic writes to higher layers. Device drivers were also modified to correctly queue atomic write requests to the hardware. Finally, the underlying disk itself must support atomic writes at the hardware level. Both NVMe and SCSI provide this feature, but in different ways; NVMe implicitly supports atomic writes for operations that remain within specified constraints, but SCSI requires a special command to ensure atomicity.
Why do databases care?
A common practice in databases is to perform disk I/O in fixed-size chunks, with 8KB and 16KB being popular I/O sizes. Databases also, however, maintain a journal that records enough information to enable recovery from a possible write error. The idea is that, if the write of new data fails, the database can take the old data present on disk as a starting point and use the information in the journal to reconstruct the new data. However, this technique is based on the assumption that the old data on disk is still consistent after the error, which may not hold if a write operation has been torn.
Tearing may happen if the I/O stack doesn't guarantee atomicity. The multi-KB write issued by the database could be split by the kernel (or the hardware) into multiple, smaller write operations. This splitting could result in a mix of old and new data being on disk after a write failure, thus leading to inconsistent on-disk data which can't be used for recovery.
To work around this possibility, databases employ an additional technique called "double write". In this approach, they first write a copy of the older data to a temporary storage area on disk and ensure that the operation completes successfully before writing to the actual on-disk tables. In case of an error in that second write operation, databases can recover by performing a journal replay on the saved copy of the older data, thus ensuring an accurate data recovery. But, as we can guess, these double writes come at a significant performance cost, especially for write-heavy workloads. This is the reason atomicity is sought after by databases; if the I/O stack can ensure that the chunks will never be torn, then databases can safely disable double writes without risking data corruption and, hence, can get that lost performance back.
Current state in Linux
As discussed during LSFMM+BPF 2024, some cloud vendors might already advertise atomic-write support using the ext4 filesystem with bigalloc, a feature that enables cluster-based allocation instead of per-block allocation. This helps to properly allocate aligned physical blocks (clusters) for atomic write operations. However, claiming to support atomic writes after auditing code to convince oneself that the kernel doesn't split a write request is one thing, while properly integrating atomic-write support with a well-defined user interface that guarantees atomicity is another.
With the Linux 6.13 release, the kernel provides a user interface for atomic writes using direct I/O. Although it has certain limitations (discussed later in this article), this marks an important step toward enabling database developers to explore these interfaces.
A block device's atomic-write capabilities are stored in struct queue_limits. These limits are exposed to user space via the sysfs interface at /sys/block/<device>/queue/atomic_*. The files atomic_write_unit_min and atomic_write_unit_max indicate the minimum and maximum number of bytes that can be written atomically. If these values are nonzero, the underlying block device supports atomic writes. However, hardware support alone is not sufficient; as mentioned earlier, the entire software stack, including the filesystem, block layer, and VFS, must also support atomic writes.
How to use the atomic-write feature
Currently, atomic-write support is only enabled for a single filesystem block. Multi-block support is under development, but those operations bring some more constraints that are still being discussed in the community. To utilize the current atomic write feature in Linux 6.13, the filesystem must be formatted with a block size that is suitable for the application's needs. A good choice is often 16KB.
Note, though, that ext4 does not support filesystem block sizes greater than the system's page size, so, on systems with 4KB page size (such as x86), ext4 cannot use a block size of 16KB and, thus, cannot support atomic write operations of that size. On the other hand, XFS recently got large block size support, allowing it to handle block sizes greater than page size. Note also that there is no problem with ext4 or XFS if the page size of the system itself is either 16KB or 64KB (such as on arm64 or powerpc64 systems), as both filesystems can handle block sizes less than or equal to the system's page size.
The following steps show how to make use of the atomic-write feature:
- First create a filesystem (ext4 or xfs) with a suitable block size
based on the atomic-write unit supported by the underlying block
device. For example:
mkfs.ext4 -b 16K /dev/sdd mkfs.xfs -bsize=16K /dev/sdd
- Next, use the statx() system call to confirm whether atomic
writes are supported on a file by the underlying filesystem. Unlike
checking the block device sysfs path, which only indicates whether the
underlying disk supports atomic writes, statx() allows the
application to query whether it is possible to request an atomic write
operation on a file and determine the supported unit size, which also
ensures that the entire I/O stack supports atomic writes.
To facilitate atomic writes, statx() now exposes the following fields when the STATX_WRITE_ATOMIC flag is passed:
- stx_atomic_write_unit_min: Minimum size of an atomic write request.
- stx_atomic_write_unit_max: Maximum size of an atomic write request.
- stx_atomic_write_segments_max: Upper limit for segments — the number of separate memory buffers that can be gathered into a write operation (e.g., the iovcnt parameter for IOV_ITER). Currently, this is always set to one.
- The STATX_ATTR_WRITE_ATOMIC flag in statx->attributes is set if atomic writes are supported.
An example statx() snippet would look like the following:
statx(AT_FDCWD, file_path, 0, STATX_BASIC_STATS | STATX_WRITE_ATOMIC, &stat_buf); printf("Atomic write Min: %d\n", stat_buf.stx_atomic_write_unit_min); printf("Atomic write Max: %d\n", stat_buf.stx_atomic_write_unit_max);
- Finally, to perform an atomic write, open the file in O_DIRECT mode and issue a pwritev2() system call with the RWF_ATOMIC flag set. Ensure that the total length of the write is a power of two that falls between atomic_write_unit_min and atomic_write_unit_max, and that the write starts at a naturally aligned offset in the file with respect to the total length of the write.
Currently, pwritev2() with RWF_ATOMIC supports only a single iovec and is limited to a single filesystem block write. This means that filesystems, when queried via statx(), report both the minimum and maximum atomic-write unit as a single filesystem block (e.g., 16KB in the example above).
The future
Kernel developers have implemented initial support for direct I/O atomic
writes that are limited to a single filesystem block. However, there is an
ongoing work which aims to extend the support to multi-block atomic writes
for both the ext4
and XFS
filesystems. Despite its limitations, this feature provides a foundation
for those interested in atomic-write support in Linux. This also presents
an opportunity for users, such as database developers, to start exploring
and experimenting with this feature. One can still collaborate with the
community to enhance this feature, as it is still under active discussion and
development.
Index entries for this article | |
---|---|
Kernel | Atomic I/O operations |
Kernel | Block layer/Atomic operations |
GuestArticles | Harjani, Ritesh |
Posted Feb 20, 2025 16:15 UTC (Thu)
by meven-collabora (subscriber, #168883)
[Link] (1 responses)
Cat pictures, and account spreadsheets are precious too.
Great work regardless.
Posted Feb 20, 2025 17:36 UTC (Thu)
by iabervon (subscriber, #722)
[Link]
The thing that's special about databases isn't that their data is more precious, it's that small, well-defined parts of the file are being changed frequently and independently, so it's necessary to modify the stored data in place, and it's feasible and worthwhile to use a file structure where individual block changes correspond to valid files states.
Posted Feb 20, 2025 16:27 UTC (Thu)
by garrier79 (subscriber, #171723)
[Link]
Posted Feb 20, 2025 19:55 UTC (Thu)
by jengelh (guest, #33263)
[Link] (1 responses)
Why do databases (still) need a double write? E.g. git, btrfs, liblmdb, seem to do just fine with writing the data block and then (atomically) updating a pointer.
Posted Feb 20, 2025 22:21 UTC (Thu)
by butlerm (subscriber, #13312)
[Link]
Filesystems like the NetApp filesystem, zfs, and btrfs use a phase tree approach that is not yet common for (relational) databases. And one of the reasons it is not common is because most relational databases were and are designed to give acceptable performance on spinning rust type hard disks and internal pointers in the database - in database indexes in particular - in most designs need to translate to the physical address of the referenced block without doing any additional physical disk I/O.
That means that they run more efficiently on direct mapped non-phase tree filesystems like xfs, ext4, or ntfs if not actual raw disk devices, which used to be quite common in some environments. If you put a typical relational database on a filesystem like btrfs or zfs it will slow down dramatically for that reason. It can be done of course, especially with something like zfs, but most people don't do it. That goes for Oracle, MySQL, PostgreSQL, DB2, MS SQL Server, Sybase, and a number of other older relational databases that are not so popular anymore.
If you want to design a relational or similar database to use a phase tree approach internally the place you probably ought to start is with typical B-tree or hash indexes, which are already multiversioned in most designs, and sometimes with versions that last for a considerable amount of time to do index rebuilds without taking a table offline.
And although it is usually slower it is possible to store primary key references instead pf physical or logical database block references in secondary indexes and use an index organized table that basically puts the row data in the leaves of a B-tree or similar tree that is ordinarily only accessed by primary key value instead of by something like (datafile, block, row). Then of course it doesn't really matter if data blocks and rows have several versions at new file / block offsets because the database would not generally access them by file / block / row number anyway except at a very low level.
PostgreSQL might be more amenable to this because if I recall correctly Postgres table data is index organized already and old row versions are stored inline and have to be vacuumed or essentially garbage collected later. Oracle stores old row versions to allow multiversion read consistency in a separate areas referred to as "rollback segments", which is one of the reasons why although it supports very long running transactions it originally had a hard time keeping up with simpler non MVCC designs like DB2, Ingres, Informix, Sybase, and MS SQL, and (of course) MySQL especially before MySQL even had automated support for transactions and ACID properties in the first place. There was a major tradeoff there for years that was usually solved by throwing lots of disk spindles at the problem, like separate disks or RAID 1 mirrored pairs for different tablespaces, data files, index segments, rollback segments, control files, and redo logs.
Posted Feb 21, 2025 2:10 UTC (Fri)
by ikm (guest, #493)
[Link] (3 responses)
Posted Feb 21, 2025 4:21 UTC (Fri)
by willy (subscriber, #9762)
[Link]
If you're using consumer grade hardware, then you have to either take the manufacturer's word for it or test it yourself.
Most consumer grade hardware does not advertise that it supports multi-block atomicity though, so the question doesn't arise.
Posted Feb 21, 2025 4:24 UTC (Fri)
by Thalience (subscriber, #4217)
[Link]
Consumer-grade nvme devices (that are already employing a flash-translation layer) have no reason not to do an atomic update when the spec says they should. But it no-longer surprises me when exposing an api contract to wider testing also exposes some vendors/devices as doing the wrong thing for no good reason.
Posted Feb 21, 2025 6:12 UTC (Fri)
by butlerm (subscriber, #13312)
[Link]
Most physical spinning rust style hard drives that I am aware of these days have 4 KB sector sizes, and most SSDs have 128 KB page sizes. In the case of the former ideally you would have 4 KB database data block sizes as well, but most databases have been using a default block size of 8192 bytes or more for some time now so they presumably account for that possibility. And a typical way that is done is to store an adequate checksum and block id of some kind inline with the database data block so that a torn write or other unusual write failure or memory corruption can be detected when the block is next processed or read back in. Some filesystems like zfs of course do something quite similar, with block checksums or secure hashes stored with all or almost all internal block pointers. SHA-256 and CRC-32c are typically supported with CPU instructions on most modern enterprise class hardware so that isn't too difficult or too slow.
Posted Feb 23, 2025 2:37 UTC (Sun)
by KJ7RRV (subscriber, #153595)
[Link] (8 responses)
Posted Feb 23, 2025 4:35 UTC (Sun)
by Cyberax (✭ supporter ✭, #52523)
[Link] (7 responses)
Posted Feb 23, 2025 7:25 UTC (Sun)
by KJ7RRV (subscriber, #153595)
[Link] (4 responses)
Posted Feb 23, 2025 13:01 UTC (Sun)
by farnz (subscriber, #17727)
[Link] (2 responses)
Capacitor is more likely, because it lets you have a write cache, too, and thus a performance advantage in enterprise drives. But it's possible without one in an SSD.
Posted Feb 24, 2025 1:32 UTC (Mon)
by Paf (subscriber, #91811)
[Link] (1 responses)
I like that this is basically double writes - sometimes, it's turtles all the way down.
Posted Mar 3, 2025 17:09 UTC (Mon)
by mebrown (subscriber, #7960)
[Link]
The new data is written to a new block, then the mapping table entry is atomically switched so that the old data is unmapped/freed and the new data is swapped in.
Posted Feb 23, 2025 13:50 UTC (Sun)
by kleptog (subscriber, #1183)
[Link]
Posted Mar 6, 2025 8:32 UTC (Thu)
by ras (subscriber, #33059)
[Link] (1 responses)
Posted Mar 18, 2025 18:20 UTC (Tue)
by sammythesnake (guest, #17693)
[Link]
Atomic writes should be the default... ideally
Like the direct I/O requirement and limited file size support of the current state.
Detecting hardware support is nicely exposed through statx already.
Hopefully this will benefit also regular user and mobile and not just database servers.
Atomic writes should be the default... ideally
large atomic writes for xfs
Why write twice
Why write twice
The real experience
The real experience
The real experience
The real experience
How does this work at a physical level?
How does this work at a physical level?
How does this work at a physical level?
It doesn't need a capacitor, necessarily. There's two routes you can take in an SSD to enable this feature:
How does this work at a physical level?
How does this work at a physical level?
How does this work at a physical level?
How does this work at a physical level?
How does this work at a physical level?
How does this work at a physical level?