> Why is requesting data off a file on disk different from requesting a block from a network block device?
Requesting a block from a disk maybe isn't. Inside Linux you call 'submit_bio' and inside the bio is a callback pointer which will eventually get called (theoretically it could be called even before the submit_bio call completes). It is completely async and it would be quiet easy to plug that into an async user-space interface if you had a good one.
But requesting data off a file is different. The fs might have to load one or more indirect blocks before it even knows where to look for the block. And then (for a more advanced fs), it might want to respond to a read failure in some arbitrarily complex fashion. Obviously this is all possible. The fs could encode that state concerning 'what to do next' in some structure and attach it to the bio that is sent off.
In generally you might have arbitrarily many levels each of which attaches state and a call-back pointer to a request and ships it down to the lower level, and that state can be arbitrarily complex and will often have 'stage' information for walking through a state-machine. What you end up building looks a lot like a traditional call stack with local variable and return addresses. But we already have a well-tested and well-understood mechanism for holding local variable and return addresses. We call it a 'thread'.
So the question is: when is it better to have lots of threads each running traditional single-threaded code, and when it is better to encode state in a separate data structure and use a state machine to process it all 'asynchronously'.
Naturally we have both approaches in the Linux kernel so we can compare.
The examples of state-machines that I am aware of are the block layer (submit_bio mentioned above) and the sunrpc layer used by NFS. Obviously the tcp/ip stack would work like a state machine, but I have no directly familiarity with its complexity. It would have the benefit of being designed and specified as a state machine, so that might help it a little.
The block layer has a single level - the queue. When lower-levels want to do something extra, they make another request and put it back on the queue. This leads to a modest amount of complexity in different flags on requests saying different things about how much the queue handlers are allowed to re-order them. It is manageable, but it isn't simple.
The sunrpc layer has wheels-within-wheels. Much like the 'indirect block' lookup mentioned earlier, sunrpc might need to perform a portmap lookup to find out which port to send a request to, and might need to try again after a timeout. I think the sunrpc code is reasonably well structured in that is isn't too hard to get an over-view of how things flow around. But I find that going much deeper gets really hard. Following error statuses around is particularly a challenge.
If these two are good examples, I would suggest that these state-machine-with-explicit-state approaches should be avoided where possible. I think it cannot be avoided in the block layer. I am less convinced about sunrpc, though at the time it was written it may well have been the best option.
I would be very hesitant about suggesting that filesystems be re-written to be completely asynchronous. It would be much more robust to find a way to fork a blocked task of into a separate thread.
Then an AIO request would either return a status or a pid, and as you suggested, reading from some fd would return the pids of the tasks as they complete. 'kill' could be used as a cancellation interface. This has all largely been suggested already on lkml. I seem to recall there were some uncertainties about making sure the user-space side always had the same pid. So maybe some issues never got ironed out and nobody put the time in to implementing it (that I am aware of).
Posted Nov 25, 2010 17:39 UTC (Thu) by kleptog (subscriber, #1183)
[Link]
Thanks a lot for that explanation, I see the difficulties now.
State machines are nice and efficient, but hard to understand and program. What are probably the most common large state machines, parsers, are often generated by other programs (flex & bison) from a higher level description, whose output no human can understand directly.
I don't think anyone has yet defined a language for filesystems that could be used to generate an appropriate state machine, perhaps the state of the art just isn't there yet.
Async I/O
Posted Nov 27, 2010 2:34 UTC (Sat) by skissane (subscriber, #38675)
[Link]
Couldn't this be a good case for writing a kernel in a language that had coroutines or continuations? One can write the code in the easier to follow procedural style, and then make it asynchronous just by adding "yield" or "call/cc" at the appropriate points. I wonder how hard it would be to come up with a version of C with coroutines, and whether that could be used with the Linux kernel...
Is there some way you could take the synchronous implementation of a driver or filesystem or other kernel component, determine the blocking points (or just require them to be annotated), and then automatically generate a state-machine-based asynchronous implementation from the synchronous one? A single source code for both synch and asynch implementations would make one more confident they were both correct.
Async I/O
Posted Nov 27, 2010 3:06 UTC (Sat) by foom (subscriber, #14868)
[Link]
But why bother? Kernel threads *are* such a state machine, just a very convenient form. What good would reinventing threads do?