Gerrit Huizenga raised this issue with the developers: why do filesystems,
when performing sequential I/O, only get about 60% of the performance of
raw I/O? The folks at IBM did some research, and came to the conclusion
that filesystem I/O is being broken into chunks that are too small. Even
when the filesystem succeeds in laying out a file's blocks contiguously on
the disk (as should happen most of the time), the actual I/O operations
tend to get broken up.
What it comes down to, it seems, is that memory pressure can force the
system to make poor choices. When memory is tight, the VM subsystem will
force out pages where it can find them without really concerning itself
with the question of whether there are nearby (on disk) pages also waiting to be
written out. The solution, according to Gerritt, is to use page clustering
for file I/O. If file I/O operations are done in larger chunks, better
performance will result.
Linus asked whether a whole I/O page clustering mechanism is really
necessary. One could, instead, modify the page writeout code to look for
adjacent pages that could go out at the same time. Of course, one could
call that approach I/O clustering under a different name. One way or
another, the 2.7 kernel will probably do better in this regard.
to post comments)