> Why is requesting data off a file on disk different from requesting a block from a network block device?
Requesting a block from a disk maybe isn't. Inside Linux you call 'submit_bio' and inside the bio is a callback pointer which will eventually get called (theoretically it could be called even before the submit_bio call completes). It is completely async and it would be quiet easy to plug that into an async user-space interface if you had a good one.
But requesting data off a file is different. The fs might have to load one or more indirect blocks before it even knows where to look for the block. And then (for a more advanced fs), it might want to respond to a read failure in some arbitrarily complex fashion. Obviously this is all possible. The fs could encode that state concerning 'what to do next' in some structure and attach it to the bio that is sent off.
In generally you might have arbitrarily many levels each of which attaches state and a call-back pointer to a request and ships it down to the lower level, and that state can be arbitrarily complex and will often have 'stage' information for walking through a state-machine. What you end up building looks a lot like a traditional call stack with local variable and return addresses. But we already have a well-tested and well-understood mechanism for holding local variable and return addresses. We call it a 'thread'.
So the question is: when is it better to have lots of threads each running traditional single-threaded code, and when it is better to encode state in a separate data structure and use a state machine to process it all 'asynchronously'.
Naturally we have both approaches in the Linux kernel so we can compare.
The examples of state-machines that I am aware of are the block layer (submit_bio mentioned above) and the sunrpc layer used by NFS. Obviously the tcp/ip stack would work like a state machine, but I have no directly familiarity with its complexity. It would have the benefit of being designed and specified as a state machine, so that might help it a little.
The block layer has a single level - the queue. When lower-levels want to do something extra, they make another request and put it back on the queue. This leads to a modest amount of complexity in different flags on requests saying different things about how much the queue handlers are allowed to re-order them. It is manageable, but it isn't simple.
The sunrpc layer has wheels-within-wheels. Much like the 'indirect block' lookup mentioned earlier, sunrpc might need to perform a portmap lookup to find out which port to send a request to, and might need to try again after a timeout. I think the sunrpc code is reasonably well structured in that is isn't too hard to get an over-view of how things flow around. But I find that going much deeper gets really hard. Following error statuses around is particularly a challenge.
If these two are good examples, I would suggest that these state-machine-with-explicit-state approaches should be avoided where possible. I think it cannot be avoided in the block layer. I am less convinced about sunrpc, though at the time it was written it may well have been the best option.
I would be very hesitant about suggesting that filesystems be re-written to be completely asynchronous. It would be much more robust to find a way to fork a blocked task of into a separate thread.
Then an AIO request would either return a status or a pid, and as you suggested, reading from some fd would return the pids of the tasks as they complete. 'kill' could be used as a cancellation interface. This has all largely been suggested already on lkml. I seem to recall there were some uncertainties about making sure the user-space side always had the same pid. So maybe some issues never got ironed out and nobody put the time in to implementing it (that I am aware of).