Disk I/O priorities
Jens Axboe has now taken a stab at the I/O priority issue with a new version of his "completely fair queueing" (CFQ) I/O scheduler. We first mentioned the CFQ scheduler back in February; it works by creating a separate request queue for every process issuing disk I/O and taking an equal number of requests from each one of them. In this way, it seeks to distribute the available I/O bandwidth equally across processes in the system and produce "completely fair" results.
The new version gives each process an I/O priority, which is a number between zero and 20 (inclusive). At the bottom end, disk I/O is only allowed when the disk would otherwise be idle. A priority of 20, instead, is the "real-time" level; all requests at that level are satisfied before any other requests are considered. The levels in between are for normal processes; by default, the I/O priority is set to 10. A pair of system calls has been added to adjust the I/O priority of a process, though the form of those calls is likely to change in the future.
Internally, the per-process request queues have now been divided into an array of 21 lists, one for each priority level. There is also a dispatch queue, which contains the requests which have been selected for processing next. A separate dispatch queue is still needed to allow some amount of request ordering and merging.
When the time comes to fill the dispatch queue, the new scheduler starts with the real-time queue. If requests are waiting there, they go straight into the dispatch queue and the process is complete. There is also an anticipatory scheduling feature for real-time requests: when the last real-time request is processed, the scheduler will wait a short period (10ms, currently) to see if any more real-time requests show up before opening the floodgates for everybody else.
In the absence of real-time requests, the code passes through each priority level, taking a decreasing number of requests from each one. Each process gets to contribute one request at a time to the dispatch queue until the quota for its priority level (expressed in both the number of requests and the number of sectors to transfer) has been reached. Requests are only taken from the idle priority queue if no other requests have been dispatched for a configurable period of time (default 100ms).
With the new CFQ scheduler, an I/O request may not be serviced even after it makes it into the dispatch queue. If a new request with real-time priority shows up, all lower-priority requests are yanked back out of the dispatch queue and have to go through the whole process again. Similarly, any non-idle requests will cause any pending idle-priority requests to lose their place in the dispatch queue.
The new scheduler appears to be uncontroversial - though it clearly is not
a critical fix and thus won't go into 2.6.0. The real debate appears to be
over how I/O priorities should be controlled. Some commenters would like
to see the nice() system call apply to I/O priorities as well as
CPU priorities. That, however, would be a fairly fundamental ABI change,
and is unlikely to happen.
