So, then, the way to get fast I/O is to make asynchronous I/O calls from a single thread (so that the scheduler knows that fairness doesn't matter) rather than spawning multiple threads or processes.
Is there any way to fork subprocesses but still let CFQ know that they're all related and happy to altruistically share I/O bandwidth between them, so it doesn't try to slice up I/O requests fairly at the expense of total throughput?