A way to do atomic writes

By Jake Edge
May 28, 2019

Finding a way for applications to do atomic writes to files, so that either the old or new data is present after a crash and not a combination of the two, was the topic of a session led by Christoph Hellwig at the 2019 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM). Application developers hate the fact that when they update files in place, a crash can leave them with old or new data—or sometimes a combination of both. He discussed some implementation ideas that he has for atomic writes for XFS and wanted to see what the other filesystem developers thought about it.

Currently, when applications want to do an atomic write, they do one of two things. Either they use "weird user-space locking schemes", as databases typically do, or they write an entirely new file, then do an "atomic rename trick" to ensure the data is in place. Unfortunately, the applications often do not use fsync() correctly, so they lose their data anyway.

In modern storage systems, the devices themselves sometimes do writes that are not in-place writes. Flash devices have a flash translation layer (FTL) that remaps writes to different parts of the flash for wear leveling, so those never actually do in-place updates. For NVMe devices, an update of one logical-block address (LBA) is guaranteed to be atomic but the interface is awkward so he is not sure if anyone is really using it. SCSI has a nice interface, with good error reporting, for writing atomically, but he has not seen a single device that implements it.

There are filesystems that can write out-of-place, such as XFS, Btrfs, and others, so it would be nice to allow for atomic writes at the filesystem layer. He said that nearly five years ago there was an interesting paper from HP Research that reported results of adding a special open() flag to indicate that atomic writes were desired. It was an academic paper that didn't deal with some of the corner cases and limitations, but had some reasonable ideas.

In that system, users can write as much data as they want to a file, but nothing will be visible until they do an explicit commit operation. Once that commit is done, all of the changes become active. One simple way to implement this would be to handle the commit operation as part of fsync(), which means that no new system call is required.

A while back, he started implementing atomic writes using this scheme in XFS. He posted some patches, but there were multiple problems there; he has since reworked that patch set. Now the authors of the paper are "pestering him" to get the code out so that they can write another paper about it with him. Others have also asked for the feature, he said.

Chris Mason asked what the granularity is; is it just a single write() call or more than that? Hellwig said that it is all of the writes that happen until the commit operation is performed. Filesystems can establish an upper bound on the amount of data that can be handled; for XFS it is based on the number of discontiguous regions (i.e. extents) that the writes touch.

This feature would work for mmap() regions as well, not just traditional write() calls. For example, Hellwig noted that it is difficult to do an atomic update of a, say, B-Tree that updates multiple nodes. With this feature, the application can just make the changes in the file-backed memory, then do the commit; if there is a crash, they will end up with one version or the other.

Ted Ts'o said that he found it amusing because someone he is advising on the Android team wants a similar feature, but wants it on a per-filesystem basis. The idea is that, when updating Android from one version to another, the ext4 or F2FS filesystem would be mounted with a magic option that would stop any journal commits from happening. An ioctl() command would then be sent once the update has finished and the journal commits would be processed. It is "kind of ugly", he said, but it gives him perhaps 90% of what would be needed to implement the atomic write feature. Toward the end of the session, Ts'o said that he believes ext4 will get the atomic write feature as well, though it will be more limited in terms of how much of the file can be updated prior to a commit.

Hellwig expressed some skepticism, noting that he had tried to do something similar by handling the updates in memory, but that became restrictive in terms of the amount of update data that could be handled. Ts'o said that for Android, the data blocks are being written to the disk, it is just the metadata updates that are being held for the few minutes required to do the update. It is a "very restrictive use case", Ts'o said, but the new mechanism replaces a device-mapper hack that was far too slow.

Chris Mason said that, depending on the interface, he would be happy to see Btrfs support it. Hellwig said that it should be fairly straightforward to do in Btrfs. One of the big blockers for him at this point is the interaction with O_DIRECT. If an application writes data atomically, then reads it back, it better get what it just wrote; no "sane application" would do that, he said, but NFS does. The Linux I/O path is not really set up to handle that, so he has some work to do there.

There was some discussion of using fsync() instead of a dedicated system call or other interface. Hellwig sees no reason not to use fsync() since it has much the same meaning; there is no reason to do one operation without the other, he said. Amir Goldstein asked about the possibility of another process using an fsync() on the file as a kind of attack.

Hellwig said that originally he was using an open() flag, but got reminded again that unused flags are not checked by open() so using a flag for data integrity is not really a good idea. Under that model, though, an fsync() would only map to the commit operation for file descriptors that had been opened with the flag. He has switched to an inode flag, which makes more sense in some ways, but it does leave open the problem of unwanted fsync() calls.

Index entries for this article
Kernel	Atomic I/O operations
Kernel	Block layer/Atomic operations
Conference	Storage, Filesystem, and Memory-Management Summit/2019

A way to do atomic writes

Posted May 29, 2019 3:53 UTC (Wed) by zblaxell (subscriber, #26385) [Link] (13 responses)

It would be nice to keep atomicity separate from durability, i.e. don't make fsync() the atomic commit primitive, unless it also works with an asynchronous variant of fsync(). A lot of applications don't care if you lose the last few updates, as long as the application never sees the results of a partial update. In database terms, this is asynchronous commit, which trades durability for performance without giving up atomicity, consistency, or integrity.

Currently what happens in Linux looks kind of weird when you write it down:

1. A bunch of application threads issue a series of well-defined(-ish) IO syscalls which modify the logical contents of a filesystem. The kernel arbitrates an order when several operations affect the same files or metadata in concurrent threads according to some rules--maybe not the best rules, or rules that everyone likes, or rules that are even written down, but there are rules and the kernel does follow them. Observers that later inspect the data in the filesystem find something equivalent to the result of performing the mutation operations in the arbitrated order. Application developers can predict how multiple threads modifying a filesystem might behave as long as the system doesn't crash. There are test suites to verify consistent behavior across filesystems, and bugs can be reported against filesystems that don't comply. But then...

2. At irregular intervals, we mash all of the writes together in big shared buffers, and unreliably spew them out to disk in arbitrary order (not necessarily the optimal IO order, although IO optimization is ostensibly why we do this) without protecting those buffers against concurrent modification. If there's no crash and no threads concurrently writing, the disk holds a complete and correct copy of the filesystem state at some chosen instant in time. If there's a crash, we just leave the user applications with a mess to clean up, proportional in size to the amount of idle RAM we had lying around, possibly containing garbage and previously deleted data. The existing behavior is the spec, so it's impossible to report any undesirable behavior as a bug because all current behavior is tautologically correct, even when it changes.

So for an application developer, we set up a bunch of expectations in #1, and then fail to deliver on all of them in #2. No wonder they hate Linux filesystems and can't use fsync() correctly!

It would be nice to get to a point where we can say that not behaving like #1 after crashes is a filesystem bug, or the result of some risky non-default mount option or open flag chosen by an administrator or application developer. Features that reduce data integrity and impose new requirements on existing applications (like delalloc, or, for that matter, Unix-style buffered writes in general) should really be opt-in. It's maybe a bit late to have this opinion now, some decades after Unix-style buffered writes became a thing, but if multiple filesystems support atomic updates, there might be an opportunity to make better choices for default filesystem behavior in the future (e.g. do periodic filesystem commits as atomic updates as well).

Databases know how to make precise tradeoffs between integrity and performance, and they can use the many shades of ranged and asynchronous fsync() effectively to implement atomic updates and all their other requirements on any filesystem, and they are more willing than most to keep up with changing filesystem API. Non-database application developers don't want have to constantly learn new ways to avoid losing data every time filesystem default behavior changes. They just want the filesystem to give back some consistent version of the data their threads asked it to write, and don't need or want to know any details beyond knobs that adjust performance and latency (i.e. "write this to disk immediately, I'll wait" at one extreme, "you can write this any convenient time in the next hour, I don't care" at the other).

A way to do atomic writes

Posted May 29, 2019 13:06 UTC (Wed) by walters (subscriber, #7396) [Link] (4 responses)

One way to look at it is - what scenarios is the current setup good for? I would argue it's nearly optimal for the interesting case of "local software builds", e.g. on a developer workstation/laptop.

Compilers tend to generate a lot of intermediate files which often get unlinked quickly. If writes weren't buffered this would be an enormous hit. It'd really force build systems to write to a tmpfs style mount instead of the persistent source code directory.

I don't think anyone would want compliers to invoke fsync() either. But it's not truly transient either - I *do* usually want the intermediate build files still there after I reboot. (Contrast with most production buildsystems that don't do incremental builds). So this falls more into your case of:

"you can write this any convenient time in the next hour, I don't care"

But then on the topic of consistency - a while ago when I was using ccache I hit a bug where some of the cached objects were corrupted (zero-filled) after I had a kernel crash, and that caused bizarre errors. This probably hits the interesting corner case of "tmpfile + rename over existing is atomic, but writing a new file isn't". It took me longer than it should have to figure out, kept doing `git clean -dfx` and double checking the integrity of my compiler, etc.

In https://github.com/ostreedev/ostree we write all new objects to a "staging" directory that includes the value of /proc/sys/random/boot_id, and then syncfs() everything, then renameat() into place (I've been meaning to investigate doing parallel/asynchronous fsync instead). We assume that the files are garbage if the boot id doesn't match (i.e. system was rebooted before commit).

A way to do atomic writes

Posted May 29, 2019 17:35 UTC (Wed) by mmastrac (guest, #132326) [Link] (2 responses)

This is great information on something I hadn't considered before. Do you have anywhere you could point me at for "best practices" to use in combination with that technique (ie: fsync the file, then syncfs, then renameat iff boot_id matches, etc)?

A way to do atomic writes

Posted May 29, 2019 20:43 UTC (Wed) by walters (subscriber, #7396) [Link] (1 responses)

I don't know of any other software using the boot ID technique; I just made it up one day. It seems to work well though.

You can see some parts of this in
https://github.com/ostreedev/ostree/blob/e0ddaa811b2f7a1a...
https://github.com/ostreedev/ostree/blob/e0ddaa811b2f7a1a...
and

But this is definitely the kind of thing that would be better with kernel assistance somewhat like the article is talking about. Some sort of open flag or fcntl that says "I always want the whole file written, or nothing" - an extension to `O_TMPFILE` that allows fsyncing at the same time as `linkat()`? Which is also closely related to my wishlist item for O_OBJECT: https://marc.info/?l=linux-fsdevel&m=139963046823575&...

A way to do atomic writes

Posted Apr 21, 2024 18:50 UTC (Sun) by BrucePerens (guest, #2510) [Link]

5 years later I don't think anything's been done about this. In addition to your concerns, linkat() could use a flag to atomically unlink an existing target file when creating the new link. That seems to be the missing piece to creating an anonymous temporary file (O_TMPFILE) and then atomically giving it a name.

A way to do atomic writes

Posted May 29, 2019 18:45 UTC (Wed) by epa (subscriber, #39769) [Link]

Perhaps dumping the object files into the same directory as the source files (and hence the same filesystem) is a bit mad anyway? A tmpfs / ramdisk approach does seem more sensible, given that there is no particular need to spool out the intermediate files to physical disk. It’s one of those “we have always done it this way” things.

A way to do atomic writes

Posted May 29, 2019 18:56 UTC (Wed) by luto (guest, #39314) [Link] (2 responses)

This could be extremely expensive. For example, suppose I do a sequence of writes, serialized in this order:

1. Create huge file B1.
2. Write to file A.
3. Create huge file B2.
4. Delete B1.
5. Create huge file B3.
6. Delete B2.

And keep doing the create / delete dance.

In your atomic model, A cannot ever be made durable without writing at least one large unnecessary file to disk.

A way to do atomic writes

Posted May 30, 2019 8:39 UTC (Thu) by mjthayer (guest, #39183) [Link]

> This could be extremely expensive. For example, suppose I do a sequence of writes, serialized in this order:
[...]
> And keep doing the create / delete dance.

> In your atomic model, A cannot ever be made durable without writing at least one large unnecessary file to disk.

And is this a use case to be optimised for, or should people learn other ways of creating huge transient files? I know that sounds like a suggestive question, but it is not. I don't feel qualified to say.

A way to do atomic writes

Posted May 30, 2019 15:52 UTC (Thu) by zblaxell (subscriber, #26385) [Link]

The same thing happens now if the dirty writeback timer (or filesystem commit interval for filesystems that have one of those) expires in the middle of this sequence. We'd write out whichever parts of B1/B2 were in cache at the time (at the end of this sequence B3 exists, so we're always going to write that one). We also do a big flush if we run out of memory while buffering B1 or B2, and in that case we block writing processes (so we do the expensive thing _and_ force userspace to wait for it).

The difference with filesystem-atomic-by-default is that we'd choose some epoch and the atomic update would include all writes completed before the epoch and none of them after (any concurrent modification during writeback would be redirected to the next atomic update). So you'd get e.g. all of A, all of B1, and the first parts of B2 in one atomic update, and a later update would delete B1, write the rest of B2, and the first parts of B3. If there's a crash before B1 gets to disk, then A disappears.

This is fine! This is the correct behavior according to what the various system calls are documented to do, assuming that writeback caching behaves like an asynchronous FIFO pipeline to disk by default (i.e. when you didn't explicitly turn off atomic update, call fsync(), or provide some other hint that says the filesystem needs to do more or less work for specific files). It's not the most performant behavior possible, so it sucks, but the most performant behavior possible does bad things to data when the system crashes, so it sucks too. Most people who aren't saturating their storage stack with iops care more about correctness than performance, and would trade some iops to get correctness even if it means flushing out a multi-gigabyte temporary file now and then.

A way to do atomic writes

Posted May 29, 2019 19:12 UTC (Wed) by iabervon (subscriber, #722) [Link] (4 responses)

I think the right model is: Imagine there's a process on the machine constantly making backups to storage that doesn't go down in the crash. It will back up every file eventually, and will back up every file that fsync() is called on during the call. After the system comes back, the state matches the backup, except that there may be some arbitrary damage, which the kernel tries to minimize.

The current behavior, or any behavior, obviously fits this model, since it means that any post-restore state is possible, but it gives a different idea of what the kernel should be optimizing for with respect to crash resilience, which seems to me to be a good fit for how users rate system behavior.

Note that Unix-style buffered writes don't cause any problem here, because ordering changes there can be explained as getting unlucky as to the order the backup process read them.

A way to do atomic writes

Posted Jun 1, 2019 0:02 UTC (Sat) by zblaxell (subscriber, #26385) [Link] (3 responses)

> Imagine there's a process on the machine constantly making backups to storage that doesn't go down in the crash.

My backups are atomic, and have been for years...

OK, I'm imagining this now: of course, I expect the storage to be updated atomically and correctly every time by this backup process. Worst case, I don't get the last update, or the last N consecutive updates if I choose to optimize for performance by pipelining.

> After the system comes back, the state matches the backup, except that there may be some arbitrary damage

[still imagining] Nope, that's totally unacceptable. If I see any damage at all, someone's getting a bug report. Broken files are failure. [imagination off]

Proprietary storage vendors have supported atomic backups from snapshots of live filesystems for decades, and Linux hasn't been far behind. They just work. This is not a new or experimental thing any more. The better implementations have reasonable performance costs. Let's have filesystems on Linux that can get it right all the time, not just between crashes.

> The current behavior, or any behavior, obviously fits this model

"It's impossible to report any undesirable behavior as a bug because all current behavior is tautologically correct, even when it changes." For example, recently I pointed out on LWN that the current behavior is undesirable, and someone soon replied to explain the current behavior back to me without supporting argument, as if it was somehow self-evident that the current behavior is the best behavior possible.

This pattern happens a lot. It usually takes several tries to get past that and get into a discussion of what's undesirable about the current behavior, or how things could become better, or even how different people just have different preferences and expectations. Then we have to get past the horrified "but...that could be slightly slower!" response. Then we have to avoid regressing to the beginning of the loop when someone who missed the first part of the conversation jumps in. Usually by this point, everyone's gotten bored and left.

The problem is not that Linux is an incorrect implementation of the current model. The problem is that the current model is patently insane, and we should maybe consider sane models that could be used instead.

> a different idea of what the kernel should be optimizing for with respect to crash resilience

This doesn't appear to be a different idea. It seems to be just a retroactive justification of the way Linux filesystems have worked since the mid 90's--a time when every crash resulted in filesystem damage requiring time-expensive recovery tools, because nothing better had been implemented yet. Now we have atomic snapshots and journals and CoW and persistent writeback caches and future improvements like atomic write API implemented in multiple filesystems. We can do better than mid-90's standards of crash behavior now.

> Note that Unix-style buffered writes don't cause any problem here, because ordering changes there can be explained as getting unlucky as to the order the backup process read them.

The filesystem and the backup process could arrange not to change the ordering, and then it would all work properly. No luck required, nothing to explain.

A way to do atomic writes

Posted Jun 1, 2019 1:47 UTC (Sat) by iabervon (subscriber, #722) [Link] (2 responses)

The operating system can't guarantee that your hardware won't destroyed entirely by your building collapsing on it, even if you called fsync() on a file. If your drive controller goes bad, it could start writing the wrong blocks. If you want to be really sure about your data, you restore of real off-site (or, at least, off-box) backups, rather than using a filesystem left on a machine that crashed, if only because you don't know what could have gone wrong before whatever caused the crash actually brought down the system.

I'm saying that the post-crash state should exactly match some state that userspace might have observed had the system never crashed, and any deviation from that should be accounted and planned for like equipment failure, except that it may be attributed to the filesystem software rather than the disk hardware; in any case, you're trading off reliability against size, performance, and cost, and none of these is ever perfectly ideal.

A way to do atomic writes

Posted Jun 3, 2019 18:25 UTC (Mon) by zblaxell (subscriber, #26385) [Link] (1 responses)

> If your drive controller goes bad, it could start writing the wrong blocks.

That happens from time to time. Storage stacks can deal with that kind of event gracefully. Today we expect those errors to be detected and reported by the filesystem or some layer below it, and small errors repaired automatically when there is sufficient redundancy in the system.

> If you want to be really sure about your data, you restore of real off-site (or, at least, off-box) backups

To make correct backups, the backup process needs a complete, correct, and consistent image of the filesystem to back up, so step one is getting the filesystem to be capable of making one of those.

Once you have that, and can atomically update it efficiently while the filesystem is online, you can stop using fsync as a workaround for legacy filesystem behaviors that should really be considered bugs now. fsync should only be used for its two useful effects: to reorder and isolate updates to individual files for reduced latency, and to synchronize IO completion with events outside of the filesystem (and those two things should become separate system calls if they aren't already). If an application doesn't need to do those two things, it should never need to call fsync, and its data should never be corrupted by a filesystem.

> you don't know what could have gone wrong before whatever caused the crash actually brought down the system.

If we allow arbitrary failure modes to be in scope, we'll always lose data. To manage risks, both the frequency and cost of event occurrence have to be considered.

Most of the time, crashes don't have complications with data integrity impact (e.g. power failure, HA forcing a reboot, kernel bugs with known causes and effects). We expect the filesystem to deal with those automatically, so we can redirect human time to cleaning up after the rarer failures: RAM failures not detected by ECC, multi-disk RAID failures, disgruntled employees with hammers, etc.

When things start going unrecoverably wrong, each subsystem that detects something wrong gives us lots of information about the failure, so we can skip directly to the replace-hardware-then-restore-from-backups step even before the broken host gets around to crashing. All the filesystem has to do in those cases is provide a correct image of user data during the previous backup cycle.

None of the above helps if the Linux filesystem software itself is where most of the unreported corruption comes from. It was barely tolerable while Linux filesystems were a data loss risk comparable to the rest of the storage stack, but over the years the rest of the stack has become more reliable while Linux filesystems have stayed the same or even gotten a little worse.

> I'm saying that the post-crash state should exactly match some state that userspace might have observed had the system never crashed, and any deviation from that should be accounted and planned for like equipment failure

I'm saying that in the event there is no equipment failure, there should be no deviation. Even if there is equipment failure, there should not necessarily be a deviation, as long as the system is equipped to handle the failure. We don't wake up a human if just one disk in a RAID array fails--that can wait until morning. We don't want a human to spend time dealing with corrupted application caches after a battery failure--the filesystem shouldn't corrupt the caches in the first place.

> in any case, you're trading off reliability against size, performance, and cost, and none of these is ever perfectly ideal.

...aaaand we're back to the horrified "but it could be slower!" chant again.

Atomic update is probably going to end up being faster than delalloc and fsync for several classes of workload once people work on optimizing it for a while, and start removing fsync workarounds from application code. fsync is a particularly bad way to manage data integrity when you don't have external synchronization constraints (i.e. when the application doesn't have to tell anyone else the stability status of its data) and when your application workload doesn't consist of cooperating threads (i.e. when the application code doesn't have access to enough information to make good global IO scheduling decisions the way that monolithic database server applications designed by domain experts do).

It's easier, faster, and safer to run a collection of applications under eatmydata on a filesystem with atomic updates than to let those applications saturate the disk IO bandwidth with unnecessary fsync calls on a filesystem that doesn't have atomic updates--provided, as I mentioned at the top, that there's a way to asynchronously pipeline the updates; otherwise, you just replace a thousand local-IO-stall problems with one big global-IO-stall problem (still a net gain, but the latency spikes can be nasty).

Decades ago, when metadata journaling was new, people complained it might be an unreasonable performance hit, but it turned out that the journal infrastructure could elide a lot of writes and be faster at some workloads than filesystems with no journal. The few people who have good reasons not to run filesystems with metadata journals can still run ext2 today, but the rest of the world moved on. Today nobody takes a filesystem seriously if it can't recover its metadata to a consistent state after uncomplicated crashes (though on many filesystems we still look the other way if the recovered state doesn't match any instantaneous pre-crash state, and we should eventually stop doing that). We should someday be able to expect user data to be consistent after crashes by default as well.

Worst case, correct write behavior becomes a filesystem option, then I can turn it on, and you can turn it off (or it becomes an inode xattr option, and I can turn it off for the six files out of a million where crash corruption is totally OK). You can continue to live in a world where data loss is still considered acceptable, and the rest of us can live in a world where we don't have to cope with the post-crash aftermath of delalloc or the pre-crash insanity of fsync.

A way to do atomic writes

Posted Jun 6, 2019 12:03 UTC (Thu) by Wol (subscriber, #4433) [Link]

> ...aaaand we're back to the horrified "but it could be slower!" chant again.

Which is a damn good reason NOT to use fsync ...

When ext4 came in, stuff suddenly started going badly wrong where ext3 had worked fine. The chant went up "well you should have used fsync!". And the chant came back "fsync on ext4 is slow as molasses!".

On production systems with multiple jobs, fsync is a sledgehammer to crack a nut. A setup that works fine on one computer WITHOUT fsync could easily require several computers WITH fsync.

Cheers,
Wol

A way to do atomic writes

Posted May 29, 2019 9:32 UTC (Wed) by nilsmeyer (guest, #122604) [Link]

I vaguely remember f2fs already having atomic writes support through some ioctl.

A way to do atomic writes

Posted May 30, 2019 5:44 UTC (Thu) by qtplatypus (subscriber, #132339) [Link] (1 responses)

I have been thinking about this and I'm wondering if on fs that support it ioctl_ficlone it might leveraged to make atomic writes of the A without D type that many application devs desire.

A sequence of

create B
ioctl_ficlone B from A
write to B
ioctl_ficlone A from B

Though that looks too simple and there may be something I'm missing.

A way to do atomic writes

Posted May 30, 2019 12:36 UTC (Thu) by Jonno (subscriber, #49613) [Link]

> Though that looks too simple and there may be something I'm missing.

For crash consistency, you have to at least add an `fdatasync B` before `ioctl_ficlone A from B`, or you might get garbage in A after recovery. It also depends upon the filesystem writing the new extent mapping of A to disk atomically, which I don't think is actually guaranteed (though most filesystems probably do so anyway).

A way to do atomic writes

Posted May 30, 2019 16:59 UTC (Thu) by perennialmind (guest, #45817) [Link] (2 responses)

This is way too reminiscent of Transactional NTFS and Microsoft's earlier attempts to bring ACID guarantees to the filesystem. The generalized contract proposed makes a huge commitment and relies an awful lot on the particulars of journaling filesystems. The per-filesystem feature doesn't strike me as "kind of ugly" at all: it sounds conservative and maintainable, with reasonably predictable ramifications.

For a more general solution, I'm much more interested in that Barrier-Enabled IO Stack for Flash Storage. It just makes sense to me. It's so neat and plausible, that I can't help but wonder if it isn't also wrong. Else, I should be reading more about it on LWN. <shrug>

All I want is a sane ordering guarantee. I'm willing to accept intermediate changes being observable if it means I can have some weak, localized constraints on causality.

write(...)
write(...)
fdatabarrier(...)
rename(...)

A way to do atomic writes

Posted Jun 2, 2019 16:55 UTC (Sun) by daniel (guest, #3181) [Link] (1 responses)

So in other words, you want exactly Tux3 semantics.

A way to do atomic writes

Posted Jun 6, 2019 14:29 UTC (Thu) by Wol (subscriber, #4433) [Link]

A synchronous flush?

Dunno how easy it would be to implement this, but imagine ...

My application (database, whatever) writes a load of stuff - a user-space journal. It then calls the flush. This triggers writing all the buffers to disk, with a guarantee that writes AFTER my sync call can be moved EARLIER in time, but ALL EARLIER writes will complete before my call returns.

That way, my application knows, when the call returns, that it's safe to start updating the files because it can recover a crash from the logs. It doesn't interfere with other applications because it's not hogging i/o. And if it's one of the few applications on an almost-single-use system then the almost continuous flushing it might trigger probably won't actually get noticed much - especially if it's a multi-threaded database because it can happily mix flushing one transaction's logs with another transaction's data.

Cheers,
Wol

A way to do atomic writes

Posted May 30, 2019 20:50 UTC (Thu) by yige (guest, #132365) [Link] (4 responses)

F.Y.I. we have an academic project, TxFS, which provides a version of Ext4 file system with ACID transactions. The idea can be generalized to other file systems that use journaling or COW to maintain crash consistency as well.
https://github.com/ut-osa/txfs

We added three new system calls to initiate/commit/abort a per-process file system transaction. It also separates ordering from durability, and does optimizations like eliminating temporary durable files, as discussed in the previous comments. e.g. The followed piece of pseudo-code will be executed without the persistence of fileA.

fs_tx_begin();
create(fileA);
create(fileB);
write(fileA);
write(fileB);
unlink(fileA);
fs_tx_commit();

It currently only supports a subset of file-related system calls since it's an experimental project. (e.g. rename and mmap are not supported yet.)

A way to do atomic writes

Posted Jun 3, 2019 17:36 UTC (Mon) by zblaxell (subscriber, #26385) [Link] (2 responses)

> The followed piece of pseudo-code will be executed without the persistence of fileA.

...if fileA and fileB can be stored in the cache memory of the implementation, yes; otherwise, you have to start writing the data of at least one of fileA or fileB to disks before you get to the unlink which removes the need to write.

That point was missed in the previous comments too: temporary file optimizers can't read your mind, they can only read an incomplete transaction log, so they only work if the temporary file gets deleted while it is (or parts of it are) still in memory. If you really want temporary file optimization, you need to label the file as a temporary as early as possible, so it can avoid getting flushed to disk too early.

It might be nice to add an open flag named O_TMPFILE that tells the system that a file shouldn't be persisted...except we already have one, and it already does that. So these tmpfile examples could look more like:

fs_tx_begin();
create(dirA, O_TMPFILE | normal_flags); // returns fileA file descriptor
create(fileB, normal_flags);
write(fileA);
write(fileB); // if we start running out of memory here, we might flush fileB, not fileA
// don't need to unlink fileA, it will disappear when closed.
fs_tx_commit();

If you need fileA to have a name or you need to close it during the transaction, it gets more complicated to use O_TMPFILE: you have to do a weird dance with /proc/self/fd/* symlinks and linkat, and you do need to do the unlink at the end, and the O_TMPFILE flag is not a correct hint to the optimizer if you change your mind and keep the file through the end of the transaction. It's also not clear that keeping fileA in RAM was a net win above--maybe it's better if fileA gets pushed out to disk so there's more RAM for caching fileB.

So maybe a separate indicator (an xattr or fadvise flag) that says "optimize by minimizing writes to this file", and nothing else, might be more useful to provide hints to transaction infrastructure. That will handle the cases where it's better to flush the temporary file to disk under memory pressure instead of the permanent one, and it can be set or cleared on files without having to modify the parts of the application that create files (which might be inaccessible or hard to modify).

The nice thing about xattrs is that system integrators can set them without having to hack up applications, so you can get broad behaviors like "everything created in this directory from now on is considered a temporary file for the purposes of transaction commit optimization."

A way to do atomic writes

Posted Jun 4, 2019 21:33 UTC (Tue) by yige (guest, #132365) [Link]

Yes, we do cache files in memory during a transaction. Flushes to disk are ignored until a transaction gets committed. This guarantees isolation for ACID transactions. But you're right, it sets up a limit for the size of a transaction. In our case, the optimization for eliminating temporary durable files is more of a positive side effect of ACID transactions. It helps especially in case users are unaware of the existence of such files in their transaction code.

I agree that a separate indicator for minimized persistency on temporary files can be a good idea, so that it gets flushed only in face of memory pressure.

A way to do atomic writes

Posted Jun 6, 2019 14:34 UTC (Thu) by Wol (subscriber, #4433) [Link]

> It might be nice to add an open flag named O_TMPFILE that tells the system that a file shouldn't be persisted...except we already have one, and it already does that. So these tmpfile examples could look more like:

I worked on a FORTRAN system in the 80s that had such a file - it flagged the file as "delete on close". Except it had a bug - it flagged the file *descriptor* as delete on close. And because you could re-use a file descriptor my program started deleting a bunch of random - important - files. Caused the operators (it was a mainframe) a small amount of grief until I twigged the problem, tested it, and raised a bug report!

Cheers,
Wol

A way to do atomic writes

Posted Apr 26, 2020 21:44 UTC (Sun) by edgecase (guest, #138459) [Link]

Looking at the set of all filesystems in-tree, out of tree, experimental, dead and gone, I wonder if there are just too many conflicting requirements, use-cases, and types of underlying hardware, to choose one set of semantics.

The POSIX API seems to be bursting at the seams also.

I wonder if ACID could be broken up, at a lower layer, and POSIX semantics built on top, as well as other APIs (or mount options?), that could focus on whatever combination of features is wanted.

One in particular I can see being useful, is similar to the Android example mentioned earlier in this thread. The use-case of an operating system package manager installing or updating a set of packages (apt-get upgrade, yum update) for example, would ideally employ the sequence:

1) unpack many files Isolated, but not durable. This takes advantage of elevator seeks and write combining, but does not put (as much) pressure on journal.
2) wait until they are all Durable (as a set, not individually)
3) rename them all into place in one transaction, directory lock could be taken once per directory, not per file
4) let them be visible (end the isolation)

Making a special filesystem for doing this isn't ideal; there should be a more general way, since in this use-case, modifying files atomically isn't of value, but for someone else it might be.

A way to do atomic writes

Posted May 20, 2021 20:33 UTC (Thu) by aist (guest, #51495) [Link]

Atomic operations on files don't have much sense now, because they imply two things:

1. The model of concurrency.
2. The system of relaxations of atomicity, to support different consistency/performance tradeoffs.

Atomicity is not cheap, and, what is much more important, it's not (easily) composable. Because of that, pushing high-level semantics down to hardware (disks) will not work as expected. Elementary (1-4 blocks) atomic operations are more than enough to support high-level composable atomic semantics across many disks. BUT, it's very hard to have high-level (generic) atomic writes which will be parallel. People, relational databases apparently provide pretty concurrent atomicity, but they rely on the fact that relational tables are unordered (sets or records), so it's relatively easy to merge multiple parallel versions of the same table into the single one. Merging of ordered data structures like vectors (and files) is not defined in general case (it's application-defined).

There are pretty good single-writer-multiple-readers (SWMR) schemes which are pretty light-weight, log-structured, wait-free and atomic for readers and writers (see LMDB for an example), but they are inherently single-writer. So, only one writer at a time can own the domain of atomicity (single file, directory, file-system). Readers are fully concurrent though with themselves and with writers. SWMR is best suitable for dynamic data analytics applications because of point-in-time semantics for readers (stable cursors, etc).

Multiple concurrent atomic writes (MWMR) are possible, but they are not that wait-free like SWMR, have much higher operational overhead and require atomic counters for (deterministic) garbage collection. And write-write conflict-resolution is application-defined. So, if we want an MWMR engine to be implemented at the device level, it will require pretty tight integration with applications, implying pretty complex APIs. Simply speaking, it isn't worth the efforts.

Log-structured SWMR/MWMR may work well with a single-block-scale atomic operations, they just need certain power-failure guaranties. They can be implemented entirely in the userspace as services on top of asynchronous IO interfaces like uring. Partial POSIX file API is emulation for legacy applications accessing the "atomic data" is also possible via FUSE.

Adding complex high-level atomic semantics (especially multi-operation commits) to the POSIX API will create much more problems than atomics are intended to solve.