Support for atomic block writes in 6.13

February 20, 2025

This article was contributed by Ritesh Harjani and Ojaswin Mujoo

Atomic block writes, which have been discussed here a few times in the past, are block operations that either complete fully or do not occur at all, ensuring data consistency and preventing partial (or "torn") writes. This means the disk will, at all times, contain either the complete new data from the atomic write operation or the complete old data from a previous write. It will never have a mix of both the old and the new data, even if a power failure occurs during an ongoing atomic write operation. Atomic writes have been of interest to many Linux users, particularly database developers, as this feature can provide significant performance improvements.

The Linux 6.13 merge window included a pull request from VFS maintainer Christian Brauner titled "vfs untorn writes", which added the initial atomic-write capability to the kernel. In this article, we will briefly cover what these atomic writes are, why they are important in database world, and what is currently supported in the 6.13 kernel.

To support atomic writes, changes were required across various layers of the Linux I/O stack. At the VFS level, an interface was introduced to allow applications to request atomic write I/O, along with enhancements to statx() to query atomic-write capabilities. Filesystems had to ensure that physical extent allocations were aligned to the underlying device's constraints, preventing extents from crossing atomic-write boundaries. For example, NVMe namespaces may define atomic boundaries; writes that straddle these boundaries will lose atomicity guarantees.

The block layer was updated to prevent the splitting of in-flight I/O operations for atomic write requests and to propagate the device constraints for atomic writes to higher layers. Device drivers were also modified to correctly queue atomic write requests to the hardware. Finally, the underlying disk itself must support atomic writes at the hardware level. Both NVMe and SCSI provide this feature, but in different ways; NVMe implicitly supports atomic writes for operations that remain within specified constraints, but SCSI requires a special command to ensure atomicity.

Why do databases care?

A common practice in databases is to perform disk I/O in fixed-size chunks, with 8KB and 16KB being popular I/O sizes. Databases also, however, maintain a journal that records enough information to enable recovery from a possible write error. The idea is that, if the write of new data fails, the database can take the old data present on disk as a starting point and use the information in the journal to reconstruct the new data. However, this technique is based on the assumption that the old data on disk is still consistent after the error, which may not hold if a write operation has been torn.

Tearing may happen if the I/O stack doesn't guarantee atomicity. The multi-KB write issued by the database could be split by the kernel (or the hardware) into multiple, smaller write operations. This splitting could result in a mix of old and new data being on disk after a write failure, thus leading to inconsistent on-disk data which can't be used for recovery.

To work around this possibility, databases employ an additional technique called "double write". In this approach, they first write a copy of the older data to a temporary storage area on disk and ensure that the operation completes successfully before writing to the actual on-disk tables. In case of an error in that second write operation, databases can recover by performing a journal replay on the saved copy of the older data, thus ensuring an accurate data recovery. But, as we can guess, these double writes come at a significant performance cost, especially for write-heavy workloads. This is the reason atomicity is sought after by databases; if the I/O stack can ensure that the chunks will never be torn, then databases can safely disable double writes without risking data corruption and, hence, can get that lost performance back.

Current state in Linux

As discussed during LSFMM+BPF 2024, some cloud vendors might already advertise atomic-write support using the ext4 filesystem with bigalloc, a feature that enables cluster-based allocation instead of per-block allocation. This helps to properly allocate aligned physical blocks (clusters) for atomic write operations. However, claiming to support atomic writes after auditing code to convince oneself that the kernel doesn't split a write request is one thing, while properly integrating atomic-write support with a well-defined user interface that guarantees atomicity is another.

With the Linux 6.13 release, the kernel provides a user interface for atomic writes using direct I/O. Although it has certain limitations (discussed later in this article), this marks an important step toward enabling database developers to explore these interfaces.

A block device's atomic-write capabilities are stored in struct queue_limits. These limits are exposed to user space via the sysfs interface at /sys/block/<device>/queue/atomic_*. The files atomic_write_unit_min and atomic_write_unit_max indicate the minimum and maximum number of bytes that can be written atomically. If these values are nonzero, the underlying block device supports atomic writes. However, hardware support alone is not sufficient; as mentioned earlier, the entire software stack, including the filesystem, block layer, and VFS, must also support atomic writes.

How to use the atomic-write feature

Currently, atomic-write support is only enabled for a single filesystem block. Multi-block support is under development, but those operations bring some more constraints that are still being discussed in the community. To utilize the current atomic write feature in Linux 6.13, the filesystem must be formatted with a block size that is suitable for the application's needs. A good choice is often 16KB.

Note, though, that ext4 does not support filesystem block sizes greater than the system's page size, so, on systems with 4KB page size (such as x86), ext4 cannot use a block size of 16KB and, thus, cannot support atomic write operations of that size. On the other hand, XFS recently got large block size support, allowing it to handle block sizes greater than page size. Note also that there is no problem with ext4 or XFS if the page size of the system itself is either 16KB or 64KB (such as on arm64 or powerpc64 systems), as both filesystems can handle block sizes less than or equal to the system's page size.

The following steps show how to make use of the atomic-write feature:

First create a filesystem (ext4 or xfs) with a suitable block size based on the atomic-write unit supported by the underlying block device. For example:
```
    mkfs.ext4 -b 16K /dev/sdd
    mkfs.xfs -bsize=16K /dev/sdd
```
Next, use the statx() system call to confirm whether atomic writes are supported on a file by the underlying filesystem. Unlike checking the block device sysfs path, which only indicates whether the underlying disk supports atomic writes, statx() allows the application to query whether it is possible to request an atomic write operation on a file and determine the supported unit size, which also ensures that the entire I/O stack supports atomic writes.
To facilitate atomic writes, statx() now exposes the following fields when the STATX_WRITE_ATOMIC flag is passed:
- stx_atomic_write_unit_min: Minimum size of an atomic write request.
- stx_atomic_write_unit_max: Maximum size of an atomic write request.
- stx_atomic_write_segments_max: Upper limit for segments — the number of separate memory buffers that can be gathered into a write operation (e.g., the iovcnt parameter for IOV_ITER). Currently, this is always set to one.
- The STATX_ATTR_WRITE_ATOMIC flag in statx->attributes is set if atomic writes are supported.
An example statx() snippet would look like the following:
```
    statx(AT_FDCWD, file_path, 0, STATX_BASIC_STATS | STATX_WRITE_ATOMIC, &stat_buf);

    printf("Atomic write Min: %d\n", stat_buf.stx_atomic_write_unit_min);
    printf("Atomic write Max: %d\n", stat_buf.stx_atomic_write_unit_max);
```
Finally, to perform an atomic write, open the file in O_DIRECT mode and issue a pwritev2() system call with the RWF_ATOMIC flag set. Ensure that the total length of the write is a power of two that falls between atomic_write_unit_min and atomic_write_unit_max, and that the write starts at a naturally aligned offset in the file with respect to the total length of the write.

Currently, pwritev2() with RWF_ATOMIC supports only a single iovec and is limited to a single filesystem block write. This means that filesystems, when queried via statx(), report both the minimum and maximum atomic-write unit as a single filesystem block (e.g., 16KB in the example above).

The future

Kernel developers have implemented initial support for direct I/O atomic writes that are limited to a single filesystem block. However, there is an ongoing work which aims to extend the support to multi-block atomic writes for both the ext4 and XFS filesystems. Despite its limitations, this feature provides a foundation for those interested in atomic-write support in Linux. This also presents an opportunity for users, such as database developers, to start exploring and experimenting with this feature. One can still collaborate with the community to enhance this feature, as it is still under active discussion and development.

Index entries for this article
Kernel	Atomic I/O operations
Kernel	Block layer/Atomic operations
GuestArticles	Harjani, Ritesh

Atomic writes should be the default... ideally

Posted Feb 20, 2025 16:15 UTC (Thu) by meven-collabora (subscriber, #168883) [Link] (1 responses)

Any application if it has the opportunity would want to use atomic writes provided it isn't too much overhead neither too much hassle.
Like the direct I/O requirement and limited file size support of the current state.
Detecting hardware support is nicely exposed through statx already.
Hopefully this will benefit also regular user and mobile and not just database servers.

Cat pictures, and account spreadsheets are precious too.

Great work regardless.

Atomic writes should be the default... ideally

Posted Feb 20, 2025 17:36 UTC (Thu) by iabervon (subscriber, #722) [Link]

Most applications don't really want block-level atomicity properties; instead, they want whole-file atomicity, where the standard pattern is to write the new data to a new inode, and then make the filename refer to the new inode after the whole thing has been written. For a cat picture, if the file ends up containing partially old data and partially new data, it doesn't solve anything if the transitions are on block boundaries or not.

The thing that's special about databases isn't that their data is more precious, it's that small, well-defined parts of the file are being changed frequently and independently, so it's necessary to modify the stored data in place, and it's feasible and worthwhile to use a file structure where individual block changes correspond to valid files states.

large atomic writes for xfs

Posted Feb 20, 2025 16:27 UTC (Thu) by garrier79 (subscriber, #171723) [Link]

JFYI, latest support for large atomics on xfs is posted here... https://lore.kernel.org/linux-xfs/0f983090-4399-4cba-910d...

Why write twice

Posted Feb 20, 2025 19:55 UTC (Thu) by jengelh (guest, #33263) [Link] (1 responses)

>To work around this possibility, databases employ an additional technique called "double write".

Why do databases (still) need a double write? E.g. git, btrfs, liblmdb, seem to do just fine with writing the data block and then (atomically) updating a pointer.

Why write twice

Posted Feb 20, 2025 22:21 UTC (Thu) by butlerm (subscriber, #13312) [Link]

If you have atomic block writes you do not need to write two different versions of the block in a "double write". That said, most databases use a redo log and block updates are committed to redo a long time before data blocks are updated in a checkpoint of some sort.

Filesystems like the NetApp filesystem, zfs, and btrfs use a phase tree approach that is not yet common for (relational) databases. And one of the reasons it is not common is because most relational databases were and are designed to give acceptable performance on spinning rust type hard disks and internal pointers in the database - in database indexes in particular - in most designs need to translate to the physical address of the referenced block without doing any additional physical disk I/O.

That means that they run more efficiently on direct mapped non-phase tree filesystems like xfs, ext4, or ntfs if not actual raw disk devices, which used to be quite common in some environments. If you put a typical relational database on a filesystem like btrfs or zfs it will slow down dramatically for that reason. It can be done of course, especially with something like zfs, but most people don't do it. That goes for Oracle, MySQL, PostgreSQL, DB2, MS SQL Server, Sybase, and a number of other older relational databases that are not so popular anymore.

If you want to design a relational or similar database to use a phase tree approach internally the place you probably ought to start is with typical B-tree or hash indexes, which are already multiversioned in most designs, and sometimes with versions that last for a considerable amount of time to do index rebuilds without taking a table offline.

And although it is usually slower it is possible to store primary key references instead pf physical or logical database block references in secondary indexes and use an index organized table that basically puts the row data in the leaves of a B-tree or similar tree that is ordinarily only accessed by primary key value instead of by something like (datafile, block, row). Then of course it doesn't really matter if data blocks and rows have several versions at new file / block offsets because the database would not generally access them by file / block / row number anyway except at a very low level.

PostgreSQL might be more amenable to this because if I recall correctly Postgres table data is index organized already and old row versions are stored inline and have to be vacuumed or essentially garbage collected later. Oracle stores old row versions to allow multiversion read consistency in a separate areas referred to as "rollback segments", which is one of the reasons why although it supports very long running transactions it originally had a hard time keeping up with simpler non MVCC designs like DB2, Ingres, Informix, Sybase, and MS SQL, and (of course) MySQL especially before MySQL even had automated support for transactions and ACID properties in the first place. There was a major tradeoff there for years that was usually solved by throwing lots of disk spindles at the problem, like separate disks or RAID 1 mirrored pairs for different tablespaces, data files, index segments, rollback segments, control files, and redo logs.

The real experience

Posted Feb 21, 2025 2:10 UTC (Fri) by ikm (guest, #493) [Link] (3 responses)

Do those things work in practice? I mean, safe recovery from a physical power failure, ideally on a consumer-grade hardware. Every time this happens to me, I can't help but pray that the database *will* actually recover, even if in theory it must...

The real experience

Posted Feb 21, 2025 4:21 UTC (Fri) by willy (subscriber, #9762) [Link]

Well. This feature is being developed for systems which are sold to you as an appliance. Your vendor has tested the feature works with the certified drives they sell to you with it.

If you're using consumer grade hardware, then you have to either take the manufacturer's word for it or test it yourself.

Most consumer grade hardware does not advertise that it supports multi-block atomicity though, so the question doesn't arise.

The real experience

Posted Feb 21, 2025 4:24 UTC (Fri) by Thalience (subscriber, #4217) [Link]

Consumer-grade nvme devices (that are already employing a flash-translation layer) have no reason not to do an atomic update when the spec says they should.

But it no-longer surprises me when exposing an api contract to wider testing also exposes some vendors/devices as doing the wrong thing for no good reason.

The real experience

Posted Feb 21, 2025 6:12 UTC (Fri) by butlerm (subscriber, #13312) [Link]

It should recover just fine as long as the sector or page size for the physical hardware is as large or in some cases larger than the database block size, and the drive in question can either write a full sector while spinning down after a power loss, or you have a battery backed cache on your drive controller, or your SSDs have capacitors big enough so that it can commit anything that is supposed to be forced to disk before it loses power, and of course your drive firmware does not have pathological bugs of the sort that used to be common in solid state drives for a while, or you have a reliabile UPS with appropriate shut everything down in case of power failure lasting more than X minutes software, or your datacenter has one that actually works.

Most physical spinning rust style hard drives that I am aware of these days have 4 KB sector sizes, and most SSDs have 128 KB page sizes. In the case of the former ideally you would have 4 KB database data block sizes as well, but most databases have been using a default block size of 8192 bytes or more for some time now so they presumably account for that possibility. And a typical way that is done is to store an adequate checksum and block id of some kind inline with the database data block so that a torn write or other unusual write failure or memory corruption can be detected when the block is next processed or read back in. Some filesystems like zfs of course do something quite similar, with block checksums or secure hashes stored with all or almost all internal block pointers. SHA-256 and CRC-32c are typically supported with CPU instructions on most modern enterprise class hardware so that isn't too difficult or too slow.

How does this work at a physical level?

Posted Feb 23, 2025 2:37 UTC (Sun) by KJ7RRV (subscriber, #153595) [Link] (8 responses)

How is an atomic write possible on physical media? Since writing data has to happen at a finite rate, thus taking nonzero time, isn't there always the possibility of abruptly losing power halfway through the write?

How does this work at a physical level?

Posted Feb 23, 2025 4:35 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (7 responses)

A capacitor can store enough charge to power the chip long enough to finish writing.

How does this work at a physical level?

Posted Feb 23, 2025 7:25 UTC (Sun) by KJ7RRV (subscriber, #153595) [Link] (4 responses)

That makes sense; thank you! So the drive has to be built with a capacitor to enable this feature?

How does this work at a physical level?

Posted Feb 23, 2025 13:01 UTC (Sun) by farnz (subscriber, #17727) [Link] (2 responses)

It doesn't need a capacitor, necessarily. There's two routes you can take in an SSD to enable this feature:

Have a capacitor or other energy store on the device, so that when power is lost, you can complete all the writes before the energy store drains.
Use the FTL's block mapping to allow you to atomically switch in a new mapping with a single bit write, and do not return that the command is complete until the new mapping is switched in. Then, you can write the new data and associated mapping, followed by the single bit write to switch the mapping over. If that bit write succeeds, the swap over is done; if it doesn't, the swap fails to happen.

Capacitor is more likely, because it lets you have a write cache, too, and thus a performance advantage in enterprise drives. But it's possible without one in an SSD.

How does this work at a physical level?

Posted Feb 24, 2025 1:32 UTC (Mon) by Paf (subscriber, #91811) [Link] (1 responses)

"Use the FTL's block mapping to allow you to atomically switch in a new mapping with a single bit write, and do not return that the command is complete until the new mapping is switched in. Then, you can write the new data and associated mapping, followed by the single bit write to switch the mapping over. If that bit write succeeds, the swap over is done; if it doesn't, the swap fails to happen."

I like that this is basically double writes - sometimes, it's turtles all the way down.

How does this work at a physical level?

Posted Mar 3, 2025 17:09 UTC (Mon) by mebrown (subscriber, #7960) [Link]

No, it's not a double write.

The new data is written to a new block, then the mapping table entry is atomically switched so that the old data is unmapped/freed and the new data is swapped in.

How does this work at a physical level?

Posted Feb 23, 2025 13:50 UTC (Sun) by kleptog (subscriber, #1183) [Link]

For non-SSDs you have hardware RAID controllers with a battery-backed memory that will hold the blocks waiting to be written and if the power fails it will keep the data and write it out when the power comes back.

How does this work at a physical level?

Posted Mar 6, 2025 8:32 UTC (Thu) by ras (subscriber, #33059) [Link] (1 responses)

Spinning Rust has always had to solve this problem so some extent. You can't have a 1/2 written block causing disk errors, so they ensured they had enough power around to complete writing a sector once it started. I'm not sure how they did it. One explanation I heard was they used rotating power in the platters. Extending that time with a super capacitor wouldn't be a big change.

How does this work at a physical level?

Posted Mar 18, 2025 18:20 UTC (Tue) by sammythesnake (guest, #17693) [Link]

With transfer speeds of GiB/s (or even MiB/s if we're talking about the Good Old Days) a 4KiB sector will take bugger all time to write, so a supercap is colossal overkill for "finish the sector" power. Flushing the whole of a sizeable cache would likely be into the realm of a non-super capacitor, though, especially in the pathological case of the cached writes all being single sectors nowhere near each other...