|
|
Subscribe / Log in / New account

A way to do atomic writes

A way to do atomic writes

Posted Jun 3, 2019 18:25 UTC (Mon) by zblaxell (subscriber, #26385)
In reply to: A way to do atomic writes by iabervon
Parent article: A way to do atomic writes

> If your drive controller goes bad, it could start writing the wrong blocks.

That happens from time to time. Storage stacks can deal with that kind of event gracefully. Today we expect those errors to be detected and reported by the filesystem or some layer below it, and small errors repaired automatically when there is sufficient redundancy in the system.

> If you want to be really sure about your data, you restore of real off-site (or, at least, off-box) backups

To make correct backups, the backup process needs a complete, correct, and consistent image of the filesystem to back up, so step one is getting the filesystem to be capable of making one of those.

Once you have that, and can atomically update it efficiently while the filesystem is online, you can stop using fsync as a workaround for legacy filesystem behaviors that should really be considered bugs now. fsync should only be used for its two useful effects: to reorder and isolate updates to individual files for reduced latency, and to synchronize IO completion with events outside of the filesystem (and those two things should become separate system calls if they aren't already). If an application doesn't need to do those two things, it should never need to call fsync, and its data should never be corrupted by a filesystem.

> you don't know what could have gone wrong before whatever caused the crash actually brought down the system.

If we allow arbitrary failure modes to be in scope, we'll always lose data. To manage risks, both the frequency and cost of event occurrence have to be considered.

Most of the time, crashes don't have complications with data integrity impact (e.g. power failure, HA forcing a reboot, kernel bugs with known causes and effects). We expect the filesystem to deal with those automatically, so we can redirect human time to cleaning up after the rarer failures: RAM failures not detected by ECC, multi-disk RAID failures, disgruntled employees with hammers, etc.

When things start going unrecoverably wrong, each subsystem that detects something wrong gives us lots of information about the failure, so we can skip directly to the replace-hardware-then-restore-from-backups step even before the broken host gets around to crashing. All the filesystem has to do in those cases is provide a correct image of user data during the previous backup cycle.

None of the above helps if the Linux filesystem software itself is where most of the unreported corruption comes from. It was barely tolerable while Linux filesystems were a data loss risk comparable to the rest of the storage stack, but over the years the rest of the stack has become more reliable while Linux filesystems have stayed the same or even gotten a little worse.

> I'm saying that the post-crash state should exactly match some state that userspace might have observed had the system never crashed, and any deviation from that should be accounted and planned for like equipment failure

I'm saying that in the event there is no equipment failure, there should be no deviation. Even if there is equipment failure, there should not necessarily be a deviation, as long as the system is equipped to handle the failure. We don't wake up a human if just one disk in a RAID array fails--that can wait until morning. We don't want a human to spend time dealing with corrupted application caches after a battery failure--the filesystem shouldn't corrupt the caches in the first place.

> in any case, you're trading off reliability against size, performance, and cost, and none of these is ever perfectly ideal.

...aaaand we're back to the horrified "but it could be slower!" chant again.

Atomic update is probably going to end up being faster than delalloc and fsync for several classes of workload once people work on optimizing it for a while, and start removing fsync workarounds from application code. fsync is a particularly bad way to manage data integrity when you don't have external synchronization constraints (i.e. when the application doesn't have to tell anyone else the stability status of its data) and when your application workload doesn't consist of cooperating threads (i.e. when the application code doesn't have access to enough information to make good global IO scheduling decisions the way that monolithic database server applications designed by domain experts do).

It's easier, faster, and safer to run a collection of applications under eatmydata on a filesystem with atomic updates than to let those applications saturate the disk IO bandwidth with unnecessary fsync calls on a filesystem that doesn't have atomic updates--provided, as I mentioned at the top, that there's a way to asynchronously pipeline the updates; otherwise, you just replace a thousand local-IO-stall problems with one big global-IO-stall problem (still a net gain, but the latency spikes can be nasty).

Decades ago, when metadata journaling was new, people complained it might be an unreasonable performance hit, but it turned out that the journal infrastructure could elide a lot of writes and be faster at some workloads than filesystems with no journal. The few people who have good reasons not to run filesystems with metadata journals can still run ext2 today, but the rest of the world moved on. Today nobody takes a filesystem seriously if it can't recover its metadata to a consistent state after uncomplicated crashes (though on many filesystems we still look the other way if the recovered state doesn't match any instantaneous pre-crash state, and we should eventually stop doing that). We should someday be able to expect user data to be consistent after crashes by default as well.

Worst case, correct write behavior becomes a filesystem option, then I can turn it on, and you can turn it off (or it becomes an inode xattr option, and I can turn it off for the six files out of a million where crash corruption is totally OK). You can continue to live in a world where data loss is still considered acceptable, and the rest of us can live in a world where we don't have to cope with the post-crash aftermath of delalloc or the pre-crash insanity of fsync.


to post comments

A way to do atomic writes

Posted Jun 6, 2019 12:03 UTC (Thu) by Wol (subscriber, #4433) [Link]

> ...aaaand we're back to the horrified "but it could be slower!" chant again.

Which is a damn good reason NOT to use fsync ...

When ext4 came in, stuff suddenly started going badly wrong where ext3 had worked fine. The chant went up "well you should have used fsync!". And the chant came back "fsync on ext4 is slow as molasses!".

On production systems with multiple jobs, fsync is a sledgehammer to crack a nut. A setup that works fine on one computer WITHOUT fsync could easily require several computers WITH fsync.

Cheers,
Wol


Copyright © 2025, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds