Improved block-layer error handling

Posted Jun 5, 2017 12:05 UTC (Mon) by nix (subscriber, #2304)
In reply to: Improved block-layer error handling by neilbrown
Parent article: Improved block-layer error handling

Aha. Your distinction makes sense: I was indeed conflating these, and fsync() does indeed provide safety, not integrity. Filesystems *are* increasingly providing integrity support, because disk vendors are not exactly brilliant at providing it (how many vendors seriously try not to wreck their SSDs' contents on power failure: only Intel? and even they don't on all parts).

Of course, POSIX provides no way for applications to say 'hey, fs, I want integrity from this, thank you', and it does whatever checksumming it can so the applications don't all need to reimplement it. This might make sense: it seems like something that could probably be a per-filesystem attribute, or at least a whole-directory-tree attribute or something. Of course, POSIX also provides no way to say 'hey, fs, this file was written but failed integrity checks': -EIO is, ah, likely to be misinterpreted by essentially everything. So while it would be nice to have app-level integrity checking, I doubt we can get there from here: we do need to do it invisibly, below the visible surface of the system.

Improved block-layer error handling

Posted Jun 5, 2017 18:51 UTC (Mon) by zblaxell (subscriber, #26385) [Link]

> POSIX provides no way for applications to say 'hey, fs, I want integrity from this, thank you'

Nor does it need one. POSIX should assume integrity by default unless applications say the opposite. One way applications can do that is by not checking any system call return values.

> POSIX also provides no way to say 'hey, fs, this file was written but failed integrity checks'

I don't think any changes to POSIX are required. We already have most of this in existing filesystems, just not in most existing filesystems.

In cases like compiles, where the writing application has completely disappeared before the block writes even start, there's no process to notify about the failure at the time the failure is detected. fsync() return behavior is irrelevant to this case--*every* system call, even _exit, returns before *any* error information is available. We want compiles to be fast, so we don't want to change this. A different solution is required. Note that reporting errors through fsync() is not wrong--it's just not applicable to this case.

For compiles we want to get the block-level error information passed from one producing process to another consuming process when the processes communicate through a filesystem. So let's do exactly that: If a block write fails, the filesystem should update its metadata to say "this data blocks were not written successfully and contain garbage now." Future reads of affected logical offsets of affected inodes should return EIO until the data is replaced by new (successful) writes, or the affected blocks are removed from the file by truncate, or the file is entirely deleted. If the filesystem metadata update fails too, move up the hierarchy (block > inode > subvol > planet > constellation > whatever) until the entire FS is set readonly and/or marked with errors for the next fsck to clean up by brute force.

Note that this scheme is different from block checksums. The behavior is similar, but block checksums are used to detect read errors (successful write followed by provably incorrect read), not write errors (where the write itself fails and the disk contents are unknown, possibly correct, possibly incorrect with hash collision). Checksums would not be an appropriate way to implement this. The existing prealloc mechanism in many filesystems could be extended to return EIO instead of zero data on reads. Prealloc already has most of the desired behavior wrt block allocation and overwrites.

> EIO is, ah, likely to be misinterpreted by essentially everything

I'm not sure how EIO could be misinterpreted in this context. The application is asking for data, and the filesystem is saying "you can't have that data because an IO-related failure occurred," so what part of EIO is misinterpreted exactly? What application (other than a general disk-health monitoring application, which could get detailed block error semantics through a dedicated kernel notification mechanism instead) would care about lower-level details, and which details would it use?

Also note EIO already happens in most filesystems, so we're not talking theoretically here. Most applications (even linkers), if they notice errors at all (**), notice EIO and do something sensible when they see it (*). This produces much, much more predictable results than just throwing random disk garbage into applications and betting they'll notice.

(*) interesting side note: linkers don't read all of the data in their input files, and will happily ignore EIO if it only occurs in the areas of files they don't read. Maybe not the best example case for a "data integrity" discussion. ;)

(**) for many years, GCC's as and ld didn't even notice ENOSPC, and would silently produce garbage binaries when the disk was full (maybe these would be detected by the linker later on...maybe not). Arguably we should also mark inodes with a persistent error bit if there is an ENOSPC while writing to them, but that *is* a major change which will surprise ordinary POSIX applications.