User: Password:
|
|
Subscribe / Log in / New account

Odd iozone results

Odd iozone results

Posted Jan 31, 2012 5:43 UTC (Tue) by sbergman27 (guest, #10767)
In reply to: Odd iozone results by raven667
Parent article: XFS: the filesystem of the future?

Look at it this way. You have 128 requests in a sorted queue. On average, each seek is going to be over about 1/128th of the platter surface. About 0.8%. Completely random seeks are going to average a distance of ~50% of the platter surface. The completely random seek time for this drive is about 8ms. The track to track is ~0.8ms according to the manufacturer's specs. Seeking over 0.8% of the platter surface is going to take a lot closer to the track to track seek time than to the completely random seek time.

The elevator algorithm basically takes you from the worst case, totally random 8ms per request time, down closer to the 0.8ms track to track seek time. That's the whole purpose of the elevator algorithm, and it works no matter how large your dataset is in relation to system ram.

Each read/write total on the affected tracks is going to get you 4k of read and 4k of write, for a total of 8k per track. 128 seeks at, 0.8ms per seek works out to 0.1 seconds for a pass over the surface. And 128 requests at 8k r/w per request works out to ~10mb/s. Now, I might should have included an extra 1/120th second to account for an extra rotation of the platter between read and write. It really all depends upon exactly what the random r/w benchmark is doing. If we do that now, we get 5MB/s.

What vmstat is showing is 600k/s - 1000k/s read. And about 2k/s write. You don't think that's odd? And doesn't it seem odd that increasing the record size by 4x increases the throughput by 60+ times. (Again, I stopped the 4k test after 45 minutes and am basing that, optimistically, upon the vmstat numbers, so the reality is uncertain. The 16k record Mixed Workload test takes 2.5 minutes to complete.)

At any rate, the Mixed Workload test is *not* a random read/write benchmark. Somewhere Dave got the idea that it said it was when it started. But it does not. And that does not even make sense in light of the *name* of the benchmark. Unfortunately, the man page for iozone is of little help. It just says it's a "Mixed Workload".

I get similar results when running Mixed Workload with 4k records on the server rather than the desktop machine.

Are you beginning to appreciate the puzzle now? I've tried to make it as clear as possible. If I have not been clear enough on any particular point, please let me know and I will explain further.

-Steve


(Log in to post comments)

Odd iozone results

Posted Jan 31, 2012 6:55 UTC (Tue) by jimparis (subscriber, #38647) [Link]

> The elevator algorithm basically takes you from the worst case, totally random 8ms per request time, down closer to the 0.8ms track to track seek time.

You're forgetting about rotational latency. After arriving at a particular track, you still need the platter to rotate to the angle at which your requested data is actually stored. For a 7200 RPM disk, that's an average of 4ms (half a rotation). I don't think you can expect to do better than that.

> 128 seeks at, 0.8ms per seek works out to 0.1 seconds for a pass over the surface. And 128 requests at 8k r/w per request works out to ~10mb/s.

With 4.8ms per seek, that same calculation gives ~1600kb/s. Which is only twice what you were seeing, and it was assuming that you're really hitting that 0.8ms track seek time. I'd really expect that the time for a seek over 0.1% of the disk surface is actually quite a bit higher than the time for seek between adjacent tracks (which is something like ~0.0001% of the disk surface, assuming 1000 KTPI and 1 inch usable space).

Odd iozone results

Posted Jan 31, 2012 7:59 UTC (Tue) by sbergman27 (guest, #10767) [Link]

That's true enough. I had not actually forgotten about it, but when I did the calculation I was off by a factor of 10. (Still getting used to the fact that my new HP50g sometimes drops keystrokes if I enter too rapidly.) It didn't seem worth messing with.

Doing a linear interpolation between for 0.8ms to seek over 0.8% of the platter, and 8ms to seek over 50% of the platter, yields ~0.9ms for the seek. (And then we can add the 4ms for 1/2 rotation.) Whether using a linear interpolation is justified here is another matter.

However, let's not forget that there is *no* reason to think that the "Mixed Workload" phase is pure random read/write. The whole random seek thing is a separate side-question WRT the actual benchmark numbers I get. If the numbers *are* purely random access time for the drive, why do 4k records get me such incredibly dismal results, and 16 records get me near the peak sequential read/write rate of the drive? One would expect the random seek read/write rate to scale linearly with the record size.

BTW, thanks for the post. This has been a very interesting exercise.

-Steve

Odd iozone results

Posted Feb 2, 2012 20:52 UTC (Thu) by dgc (subscriber, #6611) [Link]

> However, let's not forget that there is *no* reason to think that the
> "Mixed Workload" phase is pure random read/write

You should run strace on it and see what the test is really doing. That's what I did, and it's definitely doing random writes followed by random reads at the syscall level....

Dave.

Odd iozone results

Posted Jan 31, 2012 7:36 UTC (Tue) by raven667 (subscriber, #5198) [Link]

Maybe I missed it but I think thats the first time I have seen mention of the runtime for the 16kb workload 2.5m is smaller than 3h or whatever the estimate was for the previous test. Tat would be interesting to characterize but It might also be an error in testing. I notice that Jim has responded and pointed out hat some of your estimates of seek time are orders of magnitude high which is causing some of the misunderstanding.

Odd iozone results

Posted Jan 31, 2012 7:40 UTC (Tue) by raven667 (subscriber, #5198) [Link]

Gah! Spelling errors from writing late on a touchscreen, don't judge too harshly 8-)

Odd iozone results

Posted Jan 31, 2012 8:30 UTC (Tue) by sbergman27 (guest, #10767) [Link]

Let he who has never made a spelling blunder cast the first stone. Earlier, I was posting from the default Firefox in SL 6.1. Expecting it to catch my speeling errors inline. I made several posts before realizing that the damned thing was completely nonfunctional, and was too mortified to go back and look.

I'm in Chrome now. And it caught "speeling" quite handily.

RickMoen mentioned something about a "sixty second window" for LWN.net posts. I've never noticed any sort of window at all. As soon as I hit "Publish comment" my mistakes are frozen for all eternity.

Sometimes it's a comfort to know that no one really cares about my posts. ;-)

-Steve

Odd iozone results

Posted Jan 31, 2012 16:28 UTC (Tue) by raven667 (subscriber, #5198) [Link]

I think that was a different thing, IIUC Rick just posted something incorrect and someone called him on it 60s before he replied to his own post with the correction and he seemed a little embarrassed and miffed.

Odd iozone results

Posted Jan 31, 2012 8:19 UTC (Tue) by sbergman27 (guest, #10767) [Link]

We're probably focusing a bit too much on random read/write times when there is no evidence or rationale for thinking that the Mixed Workload phase is random read/write. Dave threw that in for reasons which are unclear. There is no evidence from the iozone output, or from its man page, to support the assertion.

If random i/o were the major factor in Mixed Workload, why does moving from a 4k block size take me from such a dismal data rate to nearly the peak sequential read/write data rate for the drive? It makes no sense.

That's the significant question. Why does increasing the record size by a factor of 4 result in such a dramatic increase in throughput?

And BTW, I have not yet explicitly congratulated the XFS team on fact that XFS is now nearly as fast as EXT4 on small system hardware. At least in this particular benchmark. So I will offer my congratulations now.

-Steve

Odd iozone results

Posted Jan 31, 2012 10:59 UTC (Tue) by dlang (subscriber, #313) [Link]

one thing to remember, in the general case writes can be cached, but the application pauses for reads.

If you do a sequential read test, readahead comes to your rescue, but if you are doing a mixed test, especially where the working set exceeds ram, you can end up with the application stalling waiting for a read and thus introducing additional delays between disk actions.

I don't know exactly what iozone is doing for it's mixed test, but this sort of drastic slowdown is not uncommon.

Also, if you are using a journaling filesystem, there are additional delays as the journal is updated (each write that touches the journal turns in to at least two writes, potentially with an expensive seek between them)

I would suggest running the same test on ext2 (or ext4 with the journal disabled) and see what you get.

Odd iozone results

Posted Jan 31, 2012 17:55 UTC (Tue) by jimparis (subscriber, #38647) [Link]

> We're probably focusing a bit too much on random read/write times when there is no evidence or rationale for thinking that the Mixed Workload phase is random read/write. Dave threw that in for reasons which are unclear. There is no evidence from the iozone output, or from its man page, to support the assertion.

From what I can tell by reading the source code, it is a mix where half of the threads are doing random reads, and half of the threads are doing random writes.


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds