> block numbers will need to change, but I don't see why inode numbers
> would have to change (and if you don't change those, then lots of other
> problems vanish), they are already independent of where on the disk the
> data lives.
Inode numbers in XFS are an encoding of their location on disk. To shrink, you have to physically move inodes and so their number changes.
> this seems fairly obvious to me, what am I missing that makes the
> simple approach of
[snip description of what xfs_fsr does for files]
> not work? (at least for file data)
Moving data and inodes is trivial - most of that is already there with the [almost finished] xfs_reno tool (moves inodes) and the xfs_fsr (moves data) tools. It's all the other corner cases that are complex and very hard to get right.
The "identify something to move" operation is not trivial in the case of random metadata blocks in the regions that will be shrunk. A file may have all it's data in a safe location, but it may have metadata in some place that needs to be moved (e.g. an extent tree block). Same for directories, symlinks, attributes, etc. That currently requires a complete metadata tree walk which is rather expensive. It will be easier and much faster when the reverse mapping tree goes in, though.
The biggest piece of work is metadata relocation. For each different type of metadata that needs to be relocated, the action is different - reallocation of the metadata block and then updating all the sibling, parent and multiple index blocks that point to it is not a simple thing to do. It's easy to get wrong and hard to validate. And there are a lot of different types. e.g. there are 6 different types of metadata blocks with multiply interconnected indexes in the directory structure alone.
> if you try to do this on a live filesystem
If we want it to be a fail-safe operation then it can only be done online. xfs_fsr and xfs_reno already work online and are fail-safe. Essentially, every metadata change must be atomic and recoverable and that means it has to be done through the transaction subsystem. We don't have a transaction subsystem implemented for offline userspace utilities, so a failure during an offline shrink would almost certainly result in a corrupted filesystem or data loss. :(
In case you hadn't guessed by now, one of the reasons we haven't implemented shrinking is that we know *exactly* how complex it actually is to get it right. We're not going to support a half-baked implementation that screws up, so either we do it right the first time or we don't do it at all. But if someone wants to step up to do it right then they'll get all the help they need from me. ;)