I think that's over-complicating it, imo. Any file which has less bytes stored than the logical length of the file, is "sparse" and necessarily has "holes". A hole is really only a part of the file which is not literally stored as a literal length of zeros on some filesystem data block.
The key is that the filesystem has this information *without* having to go searching through data blocks for it. It can therefore relay the hole locations to userspace efficiently, as opposed to the current userspace heuristic methods which must read() all the data and examine it for (position,length) runs of zeros, which it can then seek() forward over while writing.
So, with that:
"is 1000001 a hole?" - No. I'm quite confident no existing filesystem would store that with a hole in it. The zero run is too short. However, a general 100...001 string may, or may not contain a hole, which could potentially (but not necessarily) be reported by lseek(SEEK_HOLE).
"how many 0's need to be in place before it should be identified as a hole?" - it *should* only be identified as a hole if the zero's are not actually stored in data blocks, and thus the file is "sparse". We are not asking the filesystem to analyze the content for potential holes, but just to report what holes it currently has, which should be efficient. In fact, any lseek(SEEK_HOLE) implementation that has to examine data blocks for zero-content should probably be considered broken, imo.
"what if the source is on 4192 byte sector media and the destination is on 512 byte sector media..." - It doesn't matter. Nothing says holes are commutative, they are simply a storage optimization and need not be reproduced exactly. A destination filesystem will automatically convert to the padding and alignment of new holes in its own internal structure if you lseek over zeros (though perhaps not optimally; it depends on how the file is constructed). On vfat, for example, sparse files are not possible, and *every* zero byte will be stored literally, regardless of the sparsity of the original file.
"it seems like it may be better to have a seek_hole() function" - the FIEMAP ioctl still exists, although since it is apparently error prone, the lseek(SEEK_HOLE) interface may produce less buggy client code, and still be hugely efficient on sparse file copies.
"so that you could tell seek_hole() what you consider a hole." - There is no point in telling seek_hole() what you consider a hole, is there? You can already find and try to reproduce all possible holes in a file, in user-space, by read()ing, and then seek()ing over zeros on copies (I'll have to check what madness 'coreutils' uses for "cp --sparse"). The filesystem *can* efficiently tell us what holes it currently has, though, potentially saving a lot of read()s.
So currently, the classic way to create holes in sparse files is by lseeking() (mmap() also works); it seems somewhat logical to use lseek() to also detect those holes. It's not all powerful, but it's reasonably simple.