Not logged in
Log in now
Create an account
Subscribe to LWN
LWN.net Weekly Edition for May 23, 2013
An "enum" for Python 3
An unexpected perf feature
LWN.net Weekly Edition for May 16, 2013
A look at the PyPy 2.0 release
Odd iozone results
Posted Jan 31, 2012 3:49 UTC (Tue) by sbergman27 (guest, #10767)
I say 60x because vmstat was reporting 700k to 1000k per second. Real throughput may or may not have been far worse. It looked like it was going to take hours so I stopped it after 45 minutes.
The track to track seek on this drive is ~0.8ms. And /sys/block/sdX/queue/nr_requests is at the default 128. (Changing it to 200,000 doesn't help.) At 128 one would expect one pass over the drive platter surface, read/writing random requests, to take less than 0.1 seconds and read/write 1 MB for a read/write total of ~ 10,000KB/s. This assumes all requests to be on different cylinders.
Clearly, however, Dave is incorrect about Mixed Workload being a random read/write test. There is no way I would be seeing 60,000KB/s throughput with a 16k record size on a dataset 2x the size of ram on such a workload. And in the 4k record case I would expect to see a lot better than I do.
Clearly there is something odd going on. I'm curious what it might be.
Posted Jan 31, 2012 4:56 UTC (Tue) by raven667 (subscriber, #5198)
But I don't see where you get 60x. The second might be doing 4x more work completed per io leading to fewer iops, less expensive seeks needed to complete the test if the total data size it is re/writing is fixed
The track to track seek is 0.8 but that is the minimum for tracks which are adjacent, the latency goes up the farther away the seek distance. Average is probably closer to 6. Your estimates of a full seek across the disk are way off, that is probably closer to 15 or more and that's if you don't stop and write data 128 times along the way, like stopping at every bathroom on a road trip. That's what we mean when we say that the drive can probably only do that about 175 times a second which is the limiting factor, the time it actually takes to read/write a track is less (but of course not zero)
I haven't thought about why you are getting the specific numbers you see, I have mostly used sysbench and not iozone, but the vmstat numbers seem fine for a random io workout
Posted Jan 31, 2012 5:43 UTC (Tue) by sbergman27 (guest, #10767)
The elevator algorithm basically takes you from the worst case, totally random 8ms per request time, down closer to the 0.8ms track to track seek time. That's the whole purpose of the elevator algorithm, and it works no matter how large your dataset is in relation to system ram.
Each read/write total on the affected tracks is going to get you 4k of read and 4k of write, for a total of 8k per track. 128 seeks at, 0.8ms per seek works out to 0.1 seconds for a pass over the surface. And 128 requests at 8k r/w per request works out to ~10mb/s. Now, I might should have included an extra 1/120th second to account for an extra rotation of the platter between read and write. It really all depends upon exactly what the random r/w benchmark is doing. If we do that now, we get 5MB/s.
What vmstat is showing is 600k/s - 1000k/s read. And about 2k/s write. You don't think that's odd? And doesn't it seem odd that increasing the record size by 4x increases the throughput by 60+ times. (Again, I stopped the 4k test after 45 minutes and am basing that, optimistically, upon the vmstat numbers, so the reality is uncertain. The 16k record Mixed Workload test takes 2.5 minutes to complete.)
At any rate, the Mixed Workload test is *not* a random read/write benchmark. Somewhere Dave got the idea that it said it was when it started. But it does not. And that does not even make sense in light of the *name* of the benchmark. Unfortunately, the man page for iozone is of little help. It just says it's a "Mixed Workload".
I get similar results when running Mixed Workload with 4k records on the server rather than the desktop machine.
Are you beginning to appreciate the puzzle now? I've tried to make it as clear as possible. If I have not been clear enough on any particular point, please let me know and I will explain further.
Posted Jan 31, 2012 6:55 UTC (Tue) by jimparis (subscriber, #38647)
You're forgetting about rotational latency. After arriving at a particular track, you still need the platter to rotate to the angle at which your requested data is actually stored. For a 7200 RPM disk, that's an average of 4ms (half a rotation). I don't think you can expect to do better than that.
> 128 seeks at, 0.8ms per seek works out to 0.1 seconds for a pass over the surface. And 128 requests at 8k r/w per request works out to ~10mb/s.
With 4.8ms per seek, that same calculation gives ~1600kb/s. Which is only twice what you were seeing, and it was assuming that you're really hitting that 0.8ms track seek time. I'd really expect that the time for a seek over 0.1% of the disk surface is actually quite a bit higher than the time for seek between adjacent tracks (which is something like ~0.0001% of the disk surface, assuming 1000 KTPI and 1 inch usable space).
Posted Jan 31, 2012 7:59 UTC (Tue) by sbergman27 (guest, #10767)
Doing a linear interpolation between for 0.8ms to seek over 0.8% of the platter, and 8ms to seek over 50% of the platter, yields ~0.9ms for the seek. (And then we can add the 4ms for 1/2 rotation.) Whether using a linear interpolation is justified here is another matter.
However, let's not forget that there is *no* reason to think that the "Mixed Workload" phase is pure random read/write. The whole random seek thing is a separate side-question WRT the actual benchmark numbers I get. If the numbers *are* purely random access time for the drive, why do 4k records get me such incredibly dismal results, and 16 records get me near the peak sequential read/write rate of the drive? One would expect the random seek read/write rate to scale linearly with the record size.
BTW, thanks for the post. This has been a very interesting exercise.
Posted Feb 2, 2012 20:52 UTC (Thu) by dgc (subscriber, #6611)
You should run strace on it and see what the test is really doing. That's what I did, and it's definitely doing random writes followed by random reads at the syscall level....
Posted Jan 31, 2012 7:36 UTC (Tue) by raven667 (subscriber, #5198)
Posted Jan 31, 2012 7:40 UTC (Tue) by raven667 (subscriber, #5198)
Posted Jan 31, 2012 8:30 UTC (Tue) by sbergman27 (guest, #10767)
I'm in Chrome now. And it caught "speeling" quite handily.
RickMoen mentioned something about a "sixty second window" for LWN.net posts. I've never noticed any sort of window at all. As soon as I hit "Publish comment" my mistakes are frozen for all eternity.
Sometimes it's a comfort to know that no one really cares about my posts. ;-)
Posted Jan 31, 2012 16:28 UTC (Tue) by raven667 (subscriber, #5198)
Posted Jan 31, 2012 8:19 UTC (Tue) by sbergman27 (guest, #10767)
If random i/o were the major factor in Mixed Workload, why does moving from a 4k block size take me from such a dismal data rate to nearly the peak sequential read/write data rate for the drive? It makes no sense.
That's the significant question. Why does increasing the record size by a factor of 4 result in such a dramatic increase in throughput?
And BTW, I have not yet explicitly congratulated the XFS team on fact that XFS is now nearly as fast as EXT4 on small system hardware. At least in this particular benchmark. So I will offer my congratulations now.
Posted Jan 31, 2012 10:59 UTC (Tue) by dlang (✭ supporter ✭, #313)
If you do a sequential read test, readahead comes to your rescue, but if you are doing a mixed test, especially where the working set exceeds ram, you can end up with the application stalling waiting for a read and thus introducing additional delays between disk actions.
I don't know exactly what iozone is doing for it's mixed test, but this sort of drastic slowdown is not uncommon.
Also, if you are using a journaling filesystem, there are additional delays as the journal is updated (each write that touches the journal turns in to at least two writes, potentially with an expensive seek between them)
I would suggest running the same test on ext2 (or ext4 with the journal disabled) and see what you get.
Posted Jan 31, 2012 17:55 UTC (Tue) by jimparis (subscriber, #38647)
From what I can tell by reading the source code, it is a mix where half of the threads are doing random reads, and half of the threads are doing random writes.
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds