The killable waits are a matter of pragmatism, so you should approach the situation with that in mind. Perhaps there are, as you suggest, hundreds of places in XFS where a device failure could theoretically hang a process indefinitely, but how many actually trigger?
I'd suggest building an XFS filesystem on an iSCSI disk, and trying two basic scenarios:
1. Run a heavy file layer benchmark to simulate active use of the disk, pull the Ethernet from the iSCSI device
2. Idle the filesystem, pull the Ethernet, then immediately do 'ls' or 'cat /dev/urandom > hugeTestFile'
I suspect that in fact these scenarios will repeatedly get processes stuck in just a handful of waits within XFS. Making just these killable will, we may reasonably guess, help out a lot of administrators for much less work than your proposed "fundamental design change".