One of the results of all the recent VM improvements in the 2.5 tree is
that the shortcomings of the current block I/O scheduler are becoming more
apparent. The scheduler (or "elevator") tries to join adjacent requests
together, and to sort requests along the disk in order to minimize seeks
and optimize I/O
performance. Unfortunately, the current scheduler is not all that good at
treating requests fairly. In the name of overall performance, the
scheduler can cause the starvation of individual requests for a very long
time. Requests eventually get serviced, but if the starved request is
satisfying a page fault in the X server, the user can get seriously annoyed
in the mean time.
The scheduler shortcomings have always been there, but they are more
apparent now because the VM subsystem does a more effective job of filling
the disk request queues. Now that the scheduler problems can't hide behind
VM problems, they need to be addressed. To that end, Jens Axboe has dusted
off and fixed up his deadline I/O scheduler
patch, and is looking for people to try it out and report on how it works.
His initial results are good:
The Andrew Morton Interactive Workload (AMIW) rates the current
kernel poorly, on my test machine it completes in 1-2 minutes
depending on your luck. 2.5.38-BK does a lot better, but mainly
because it's being extremely unfair. This deadline io scheduler
finishes the AMIW in anywhere from ~0.5 seconds to ~3-4 seconds,
depending on the io load.
The "AMIW" works, essentially, by creating a great deal of disk write
traffic, then seeing how long it takes to list out a large directory while
the writes are being handled.
The idea behind the deadline scheduler is that all read requests get
satisfied within a specified time period - 1/2 second by default.
(Write requests do not have specific deadlines; processes usually can not
continue until their reads are performed, but writes can generally be
deferred indefinitely). The deadline performance is achieved by sorting
I/O requests into several different queues:
- The sorted queues (sort_list) contain I/O requests sorted
by the elevator for best disk performance. There are two queues, one
for read requests, and one for writes.
- The FIFO queue (read_fifo) contains all read requests. Since
it's a FIFO, the
request at the head of the queue will be the oldest, and will be the
one whose deadline expires first.
- The dispatch queue contains requests which are ready to be
passed down to the block driver. Requests do not go into the dispatch
queue until the scheduler decides that it is time to execute them.
With these lists, the operation of the scheduler is not that hard.
Incoming requests are merged into the proper sort_list in the
usual way: they are sorted by sector number, and merged with adjacent
requests. (The real operation is complicated by barrier operations and a
hashing mechanism designed to improve the performance of the scheduler
itself, but the basic idea is as described.) Read requests are also given
a deadline time stamp and put on the end of the read_fifo list.
When the block driver is ready to start another disk I/O request, the core
algorithm of the deadline scheduler comes into play. In simplified form,
it works like this:
- If there are requests waiting in the dispatch queue then the scheduler
need do no work, since it has already decided what it wants to do
- Otherwise it's time to move a new set of requests to the dispatch
queue. The scheduler looks in the following places, taking requests
from (only) the first source that it finds:
- If there are pending write requests, and the scheduler has not
selected any writes for "a long time," a set of write requests is
- If there are expired read requests in the read_fifo
list, take a set from there.
- If there are pending reads in the sorted queue, some of them are
- Finally, if there are pending write operations, the dispatch
queue is filled from the sorted write queue.
- Once that's done, if there are any requests in the dispatch queue
(i.e. if there's anything for the disk to do at all) the first one is
passed back to the driver.
The definition of "a long time" for write request starvation is simply two
iterations of the scheduler algorithm. After two sets of read requests
have been moved to the dispatch queue, the scheduler will send over a set
of writes. A "set," by default, can be as many as 64 contiguous requests -
but a request that requires a disk seek counts the same as 16 contiguous
There has been no word from Linus on this patch; barring problems, however,
it would be surprising if some form of it were not merged in the near
to post comments)