LWN.net Logo

I/O scheduler performance

I/O scheduler performance

Posted Aug 23, 2010 15:30 UTC (Mon) by epa (subscriber, #39769)
In reply to: A systemd status update by mezcalero
Parent article: A systemd status update

one thing we learned is that maximizing parallelization the way we do has little benefit on the disk elevator on rotating media. Naive people like me assumed that providing the current Linux IO scheduler with a larger amount of requests at the same time it can choose from would improve its performance. Turns out it currently doesn't really.
This is backed up by folk wisdom about how to get fast I/O on Unix-like systems. Do we run multiple cp commands in parallel to give the I/O scheduler more choice? When browsing a directory of image files, does the thumbnail viewer try to open them all simultaneously? We all know that this will tend to make things slower, not faster.

So has research in disk elevator algorithms reached the point where it's possible to do better - to throw large numbers of requests at the system and have it respond to them faster than it would if given one at a time? Or are we stuck (on rotating media at least) with the practical reality that most of the time, requests to the same disk are best made one at a time?


(Log in to post comments)

I/O scheduler performance

Posted Aug 23, 2010 16:31 UTC (Mon) by dlang (✭ supporter ✭, #313) [Link]

the problem is that the system has no way of knowing when you submit all this I/O if you mean

do all of these, and minimize the overall time

or

I need these to all make progress at the same time, even if it means taking longer overall.

current algorithms tend to assume the second, they try to split the available I/O bandwidth between all the requests, since this ends up resulting in lots of seeks, this hurts on traditional media with massive parallel requests

a small amount of parallelism helps by giving the drive something to do when it would otherwise be idle, however once you pass the saturation point it hurts because it adds additional seeks as the system jumps from one set of requests to the next.

this is the same sort of thing that makes hyperthreading be anywhere from a noticable benifit to a mild loss depending on the workload

I/O scheduler performance

Posted Aug 23, 2010 20:04 UTC (Mon) by axboe (subscriber, #904) [Link]

A scheduler like CFQ will attempt to provide a mix of what you describe, depending on how you submit it. If the submission is done from one process, it will assume that you want it to be done as fast as possible. It'll be sorted accordingly. If done from multiple processes or threads, it will attempt to provide equal progress while preserving overall throughput.

What you describe is true on classical work conserving IO schedulers, it's not the case for the default Linux IO scheduler.

I/O scheduler performance

Posted Sep 8, 2010 11:15 UTC (Wed) by epa (subscriber, #39769) [Link]

So, then, the way to get fast I/O is to make asynchronous I/O calls from a single thread (so that the scheduler knows that fairness doesn't matter) rather than spawning multiple threads or processes.

Is there any way to fork subprocesses but still let CFQ know that they're all related and happy to altruistically share I/O bandwidth between them, so it doesn't try to slice up I/O requests fairly at the expense of total throughput?

I/O scheduler performance: not good enough!

Posted Aug 23, 2010 23:15 UTC (Mon) by renox (subscriber, #23785) [Link]

I remember being very impressed by a paper where userspace application can schedule their I/O by knowing (an approximation of) the block number for the file they use:
>>As an example, a tar of the Linux kernel tree was 82.5 seconds using GNU tar, while our modified tar completed in 17.9 seconds.<<

The paper is here:
http://simula.no/research/nd/publications/Simula.ND.399/s...

I wonder if there will be eventually a system call to know the block number of a file?

I/O scheduler performance: not good enough!

Posted Aug 24, 2010 0:12 UTC (Tue) by wmf (guest, #33791) [Link]

The FIEMAP ioctl does this, but in general apps probably shouldn't try to implement such optimizations; Linux isn't an exokernel.

I/O scheduler performance: not good enough!

Posted Aug 24, 2010 21:53 UTC (Tue) by mhelsley (subscriber, #11324) [Link]

I don't think FIEMAP provides block numbers. Block numbers would be unique on the partition/device whereas I think FIEMAP provides "extents" (and FIBMAP provides bits) which effectively describe offsets within the file -- not block numbers.

http://lwn.net/Articles/260803/

I/O scheduler performance: not good enough!

Posted Aug 25, 2010 11:28 UTC (Wed) by etienne (subscriber, #25256) [Link]

Both FIEMAP and FIBMAP report block numbers, i.e. the position of the file on the device.
You have to guess yourself the position of the device on the hardware device, simple if they are the same of there is a simple partition table - difficult if there is LVM or MD(RAID) in between.
FIEMAP is a lot quicker than FIBMAP, noticeable when getting the mapping of ISO file images.
As an example:

$ wget http://www.mirrorservice.org/sites/download.sourceforge.n...
$ gcc -O2 showmap.c -o showmap
$ ./showmap ./showmap.c
File "./showmap.c" of size 15013 (32 blocks512) is on filesystem 0x302.
According to /proc/diskstats, file './showmap.c' is on device '/dev/hda2'
/dev/hda2: Permission denied
$ su
Password:
# ./showmap ./showmap.c
File "./showmap.c" of size 15013 (32 blocks512) is on filesystem 0x302.
According to /proc/diskstats, file './showmap.c' is on device '/dev/hda2'
Device block size: 512, FS block size: 4096, device size: 61432560 blocks
Device length: 31453470720 bytes
The device start at 61432560 sectors, C/H/S: 65535/16/63.
First FIEMAP says 1 extents
second FIEMAP success, 1 extents filled:
0: logical offset 0 (0 * 4096), physical offset 25904218112 (6324272 * 4096),
length 16384 (4 * 4096) flags 0x1
flags meaning: FIEMAP_EXTENT_LAST 0x1, FIEMAP_EXTENT_UNKNOWN 0x2,
FIEMAP_EXTENT_DELALLOC 0x4, FIEMAP_EXTENT_ENCODED 0x8,
FIEMAP_EXTENT_DATA_ENCRYPTED 0x0, FIEMAP_EXTENT_NOT_ALIGNED 0x100,
FIEMAP_EXTENT_DATA_INLINE 0x200, FIEMAP_EXTENT_DATA_TAIL 0x400,
FIEMAP_EXTENT_UNWRITTEN 0x800, FIEMAP_EXTENT_MERGED 0x1000.
FIGETBSZ: block size 4096 bytes
File (4 blocks of 4096) start at block 6324272 for 4 blocks,
last block 6324275 and file has 1 fragments.
FIBMAP succeeded after end of file, block index 4 give block 0
#

I/O scheduler performance: not good enough!

Posted Aug 26, 2010 12:48 UTC (Thu) by renox (subscriber, #23785) [Link]

Interesting, thanks.
It didn't work on my computer though:
>>
./showmap ./showmap
Boot record of size 512 bytes read successfully from './showmap'
First bytes 0x7F 0x45 0x4C, signature 0x804, FAT16 sig 0x0, FAT32 sig 0x4,
No FAT signature recognised, cannot analyse header.
Partition table: WindowsNTmarker 0x3C, Unknown 0x0, signature 0x804
0: indicator 0x0 i.e. non bootable, start 5570560 length 0
(i.e. end at 5570560); start 513/0/43 end 0/0/18 name 'empty'
1: indicator 0x0 i.e. non bootable, start 3014656 length 0
(i.e. end at 3014656); start 768/0/29 end 0/0/18 name 'empty'
2: indicator 0x0 i.e. non bootable, start 6750208 length 0
(i.e. end at 6750208); start 0/0/54 end 0/0/18 name 'empty'
3: indicator 0x0 i.e. non bootable, start 4587520 length 2438987776
(i.e. end at 2443575296); start 256/0/60 end 0/0/18 name 'empty'
<<
Probably a not recent enough version..

I/O scheduler performance: not good enough!

Posted Aug 26, 2010 13:33 UTC (Thu) by etienne (subscriber, #25256) [Link]

Sorry, said to compile showmap.c and gave the link to showmbr.c, the good link is:
http://www.mirrorservice.org/sites/download.sourceforge.n...

I/O scheduler performance: not good enough!

Posted Aug 29, 2010 10:38 UTC (Sun) by renox (subscriber, #23785) [Link]

Thanks, I should have seen this but the long URL was shortened which hid the issue.
I'll try again.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds