LWN: Comments on "Fixing error reporting—again"

Fixing error reporting—again

jlayton — Sun, 16 Jun 2019 01:56:11 +0000

sync() is void return. syncfs() returns an int, and so could (in principle) return an error if there is a problem with writeback. syncfs() is not defined by POSIX, so it's not "broken" per-se, but I think it'd probably be more helpful to have it return an error if there was an issue with writeback.

Fixing error reporting—again

quocanh1897 — Thu, 11 Apr 2019 11:39:29 +0000

> syncfs() is "really broken" in its error reporting. He plans to fix that, probably by using another errseq_t in the superblock, since reporting from syncfs() requires a separate cursor on the error state.
I thought sync() always returns success, how does it "really broken"? And what is "separate cursor on the error state"?
Thanks.

Fixing error reporting—again

Trol1024 — Tue, 04 Sep 2018 05:15:58 +0000

The crucial thing may be that a read() after a successful open()-write()-close() may return old data.

That may happen where an async writeback error occurs after close() and the inode/mapping get evicted before read().

That violate POSIX as POSIX requires that a read() that can be proved to occur after a write() has returned will return the new data.

Fixing error reporting—again

bfields — Fri, 27 Apr 2018 21:29:49 +0000

"I wonder how many people don't fsync on NFS because they know close() is enough and are about to find out that it isn't."

Do you think that's really likely?

Linux knfsd doesn't support write delegations, but I believe that both the client and some popular servers have supported them for a while, and I don't recall seeing such a bug report.

So, I'm optimistic, but I suppose it's something to keep an eye on. (Possibly also worth checking that the man pages don't provide any false guarantees here.)

Fixing error reporting—again

mjg59 — Fri, 27 Apr 2018 06:24:42 +0000

> Remember when ext3 had that wonderful "rename causes fsync" semantic, so no body bothered to fsync

No? The behaviour people were expecting was that doing a write and then a rename would result in those operations happening in order and that you'd either end up with the old file or the new file. People weren't fsyncing because they didn't care *which* file ended up on disk, not because they were expecting rename to cause an implicit fsync.

Fixing error reporting—again

donald.buczek — Fri, 27 Apr 2018 05:34:26 +0000

There is a difference: The Cought Fire error class would be reported by a read error.

Fixing error reporting—again

neilbrown — Fri, 27 Apr 2018 02:37:42 +0000

> Unless you have a write delegation, I believe....

uh-oh.
Remember when ext3 had that wonderful "rename causes fsync" semantic, so no body bothered to fsync and when ext4 had more sane semantics people complained?
I wonder how many people don't fsync on NFS because they know close() is enough and are about to find out that it isn't.

Fixing error reporting—again

bfields — Fri, 27 Apr 2018 00:56:26 +0000

"NFS (and possibly other similar filesystems) is a bit different as close() always does an internal fsync() first - so a lack of an error there means that all the data is safe."

Unless you have a write delegation, I believe....

Fixing error reporting—again

MarcB — Thu, 26 Apr 2018 13:54:13 +0000

> The thing about close() is that a lack of an error doesn't tell you anything about the data. It just tells you that writeback hasn't hit an error *yet*. I don't see how you can depend on something that is already unreliable.

That is exactly the question I ask as an application developer: What does an error on close() mean, and why should I check it?
As I see it, *not* doing fsync(), but then checking close(), only catches errors in an unreliable way.

As an example:

Let's assume I do a doomed write(), that will hit a bad block, followed directly by a close().

Now, after the write(), I get preempted for some time, and when my process runs again, and can submit the close(), it will get the error that occurred while other processes where running. Fine.

But now, I am on an idle system and will be scheduled immediately once my write() returns. The error has not occurred yet, and I won't see it. Not so fine.

(Alternatively: The first write() happens shortly before the automatic ext4 filesystem sync, the second shortly after).

So, if I get an error from close(), something is wrong. But if I don't get the error, exactly the same thing might be wrong, it's just that no one has noticed yet.

I find it hard, to come up with a scenario where that would be truly useful, but perhaps I am missing something. (Quotas and NFS are obvious candidates; they might add failure classes that close() catches reliably).

Fixing error reporting—again

epa — Thu, 26 Apr 2018 08:38:26 +0000

Well yes, and even if writeback has succeeded that doesn't promise you that the hard disk won't spontaneously catch fire and destroy your data tomorrow. The idea is to report, reliably, any errors that have occurred so far.

Fixing error reporting—again

neilbrown — Wed, 25 Apr 2018 20:51:16 +0000

> But it is documented that the close() call can return errors, so some users will be dependent on that behavior, Chinner said.

The thing about close() is that a lack of an error doesn't tell you anything about the data. It just tells you that writeback hasn't hit an error *yet*. I don't see how you can depend on something that is already unreliable.

NFS (and possibly other similar filesystems) is a bit different as close() always does an internal fsync() first - so a lack of an error there means that all the data is safe. For other filesystems, we don't need to go out of our way to report an error that cannot be relied upon anyway.