Odd iozone results

Posted Jan 30, 2012 0:15 UTC (Mon) by dgc (subscriber, #6611)
In reply to: Odd iozone results by sbergman27
Parent article: XFS: the filesystem of the future?

Hi Steve,

> I decided to take Dave's claim of XFS being ood for small systems
> seriously.
....

> I ran:
>
> iozone -i0 -i1 -i2 -i4 -i8 -l 4 -u 4 -F 1 2 3 4 -s 2g

That is a data IO only workload - it has no metadata overhead at all. You're comparing apples to oranges considering the talk was all about metadata performance....

> I stopped the mixed workload test after 45 minutes.
.....
> What in the world is going on here???

The mixed workload does random 4KB write IO - it even tells you that when it starts. Your workload doesn't fit in memory, so it will run at disk speed. 700KB/s is 175 4k IOs per second, which is about 6ms per IO. That's close to the average seek time of a typical 7200rpm SATA drive. Given that your workload is writing 8GB of data at that speed, it will take around 3 hours to run to completion.

IOWs, there is nothing wrong with your system - it's writing data in exactly the way you asked it to do, albeit slowly. This is not the sort of problem that an experienced storage admin would fail to diagnose....

Dave.

Odd iozone results

Posted Jan 30, 2012 2:48 UTC (Mon) by raven667 (subscriber, #5198) [Link] (17 responses)

The mixed workload does random 4KB write IO - it even tells you that when it starts. Your workload doesn't fit in memory, so it will run at disk speed. 700KB/s is 175 4k IOs per second, which is about 6ms per IO. That's close to the average seek time of a typical 7200rpm SATA drive. Given that your workload is writing 8GB of data at that speed, it will take around 3 hours to run to completion. IOWs, there is nothing wrong with your system - it's writing data in exactly the way you asked it to do, albeit slowly. This is not the sort of problem that an experienced storage admin would fail to diagnose....

in my experience the average server admin wouldn't be able to characterize or understand this issue. I've seen a lot of magical thinking from admins and developers when it comes to storage, sometimes for networking as well although the diagnostics tend to be better for that (wireshark, tcpdump).

Odd iozone results

Posted Jan 30, 2012 6:10 UTC (Mon) by sbergman27 (guest, #10767) [Link] (1 responses)

Nice jab. But you'll want to see my update to Dave.

Odd iozone results

Posted Jan 30, 2012 23:08 UTC (Mon) by raven667 (subscriber, #5198) [Link]

Sorry, I didn't mean to sound insulting but on re-reading I can see how it could be read that way. I'm not trying to jab to score points or anything, just trying to have an interesting conversation.

Odd iozone results

Posted Jan 30, 2012 22:07 UTC (Mon) by sbergman27 (guest, #10767) [Link] (14 responses)

As per my request to you in the /usr thread, I'm picking up here.

Firstly, for context, please read my most recent post to Dave.

I/O scheduler issues aside, the idea of the problem being, fundamentally, raw seek time is out the window.

Else how would it perform so well as it does with 16k records rather than 4k records? A 4x improvement might be expected. Not 60+ times.

-Steve

Odd iozone results

Posted Jan 30, 2012 23:15 UTC (Mon) by raven667 (subscriber, #5198) [Link] (13 responses)

Am I misreading the numbers, I don't see a 60x difference anywhere, it looks like about 60000-70000KB/s in all cases with minor differences between xfs or ext4.

Odd iozone results

Posted Jan 31, 2012 3:49 UTC (Tue) by sbergman27 (guest, #10767) [Link] (12 responses)

You're looking at the numbers after I changed the record size to 16k. See the original, running at the default record size of 4k:

http://lwn.net/Articles/477865/

I say 60x because vmstat was reporting 700k to 1000k per second. Real throughput may or may not have been far worse. It looked like it was going to take hours so I stopped it after 45 minutes.

The track to track seek on this drive is ~0.8ms. And /sys/block/sdX/queue/nr_requests is at the default 128. (Changing it to 200,000 doesn't help.) At 128 one would expect one pass over the drive platter surface, read/writing random requests, to take less than 0.1 seconds and read/write 1 MB for a read/write total of ~ 10,000KB/s. This assumes all requests to be on different cylinders.

Clearly, however, Dave is incorrect about Mixed Workload being a random read/write test. There is no way I would be seeing 60,000KB/s throughput with a 16k record size on a dataset 2x the size of ram on such a workload. And in the 4k record case I would expect to see a lot better than I do.

Clearly there is something odd going on. I'm curious what it might be.

-Steve

Odd iozone results

Posted Jan 31, 2012 4:56 UTC (Tue) by raven667 (subscriber, #5198) [Link] (11 responses)

Yes I see that and the numbers in

https://lwn.net/Articles/478014/

But I don't see where you get 60x. The second might be doing 4x more work completed per io leading to fewer iops, less expensive seeks needed to complete the test if the total data size it is re/writing is fixed

The track to track seek is 0.8 but that is the minimum for tracks which are adjacent, the latency goes up the farther away the seek distance. Average is probably closer to 6. Your estimates of a full seek across the disk are way off, that is probably closer to 15 or more and that's if you don't stop and write data 128 times along the way, like stopping at every bathroom on a road trip. That's what we mean when we say that the drive can probably only do that about 175 times a second which is the limiting factor, the time it actually takes to read/write a track is less (but of course not zero)

I haven't thought about why you are getting the specific numbers you see, I have mostly used sysbench and not iozone, but the vmstat numbers seem fine for a random io workout

Odd iozone results

Posted Jan 31, 2012 5:43 UTC (Tue) by sbergman27 (guest, #10767) [Link] (10 responses)

Look at it this way. You have 128 requests in a sorted queue. On average, each seek is going to be over about 1/128th of the platter surface. About 0.8%. Completely random seeks are going to average a distance of ~50% of the platter surface. The completely random seek time for this drive is about 8ms. The track to track is ~0.8ms according to the manufacturer's specs. Seeking over 0.8% of the platter surface is going to take a lot closer to the track to track seek time than to the completely random seek time.

The elevator algorithm basically takes you from the worst case, totally random 8ms per request time, down closer to the 0.8ms track to track seek time. That's the whole purpose of the elevator algorithm, and it works no matter how large your dataset is in relation to system ram.

Each read/write total on the affected tracks is going to get you 4k of read and 4k of write, for a total of 8k per track. 128 seeks at, 0.8ms per seek works out to 0.1 seconds for a pass over the surface. And 128 requests at 8k r/w per request works out to ~10mb/s. Now, I might should have included an extra 1/120th second to account for an extra rotation of the platter between read and write. It really all depends upon exactly what the random r/w benchmark is doing. If we do that now, we get 5MB/s.

What vmstat is showing is 600k/s - 1000k/s read. And about 2k/s write. You don't think that's odd? And doesn't it seem odd that increasing the record size by 4x increases the throughput by 60+ times. (Again, I stopped the 4k test after 45 minutes and am basing that, optimistically, upon the vmstat numbers, so the reality is uncertain. The 16k record Mixed Workload test takes 2.5 minutes to complete.)

At any rate, the Mixed Workload test is *not* a random read/write benchmark. Somewhere Dave got the idea that it said it was when it started. But it does not. And that does not even make sense in light of the *name* of the benchmark. Unfortunately, the man page for iozone is of little help. It just says it's a "Mixed Workload".

I get similar results when running Mixed Workload with 4k records on the server rather than the desktop machine.

Are you beginning to appreciate the puzzle now? I've tried to make it as clear as possible. If I have not been clear enough on any particular point, please let me know and I will explain further.

-Steve

Odd iozone results

Posted Jan 31, 2012 6:55 UTC (Tue) by jimparis (guest, #38647) [Link] (2 responses)

> The elevator algorithm basically takes you from the worst case, totally random 8ms per request time, down closer to the 0.8ms track to track seek time.

You're forgetting about rotational latency. After arriving at a particular track, you still need the platter to rotate to the angle at which your requested data is actually stored. For a 7200 RPM disk, that's an average of 4ms (half a rotation). I don't think you can expect to do better than that.

> 128 seeks at, 0.8ms per seek works out to 0.1 seconds for a pass over the surface. And 128 requests at 8k r/w per request works out to ~10mb/s.

With 4.8ms per seek, that same calculation gives ~1600kb/s. Which is only twice what you were seeing, and it was assuming that you're really hitting that 0.8ms track seek time. I'd really expect that the time for a seek over 0.1% of the disk surface is actually quite a bit higher than the time for seek between adjacent tracks (which is something like ~0.0001% of the disk surface, assuming 1000 KTPI and 1 inch usable space).

Odd iozone results

Posted Jan 31, 2012 7:59 UTC (Tue) by sbergman27 (guest, #10767) [Link] (1 responses)

That's true enough. I had not actually forgotten about it, but when I did the calculation I was off by a factor of 10. (Still getting used to the fact that my new HP50g sometimes drops keystrokes if I enter too rapidly.) It didn't seem worth messing with.

Doing a linear interpolation between for 0.8ms to seek over 0.8% of the platter, and 8ms to seek over 50% of the platter, yields ~0.9ms for the seek. (And then we can add the 4ms for 1/2 rotation.) Whether using a linear interpolation is justified here is another matter.

However, let's not forget that there is *no* reason to think that the "Mixed Workload" phase is pure random read/write. The whole random seek thing is a separate side-question WRT the actual benchmark numbers I get. If the numbers *are* purely random access time for the drive, why do 4k records get me such incredibly dismal results, and 16 records get me near the peak sequential read/write rate of the drive? One would expect the random seek read/write rate to scale linearly with the record size.

BTW, thanks for the post. This has been a very interesting exercise.

-Steve

Odd iozone results

Posted Feb 2, 2012 20:52 UTC (Thu) by dgc (subscriber, #6611) [Link]

> However, let's not forget that there is *no* reason to think that the
> "Mixed Workload" phase is pure random read/write

You should run strace on it and see what the test is really doing. That's what I did, and it's definitely doing random writes followed by random reads at the syscall level....

Dave.

Odd iozone results

Posted Jan 31, 2012 7:36 UTC (Tue) by raven667 (subscriber, #5198) [Link] (6 responses)

Maybe I missed it but I think thats the first time I have seen mention of the runtime for the 16kb workload 2.5m is smaller than 3h or whatever the estimate was for the previous test. Tat would be interesting to characterize but It might also be an error in testing. I notice that Jim has responded and pointed out hat some of your estimates of seek time are orders of magnitude high which is causing some of the misunderstanding.

Odd iozone results

Posted Jan 31, 2012 7:40 UTC (Tue) by raven667 (subscriber, #5198) [Link] (2 responses)

Gah! Spelling errors from writing late on a touchscreen, don't judge too harshly 8-)

Odd iozone results

Posted Jan 31, 2012 8:30 UTC (Tue) by sbergman27 (guest, #10767) [Link] (1 responses)

Let he who has never made a spelling blunder cast the first stone. Earlier, I was posting from the default Firefox in SL 6.1. Expecting it to catch my speeling errors inline. I made several posts before realizing that the damned thing was completely nonfunctional, and was too mortified to go back and look.

I'm in Chrome now. And it caught "speeling" quite handily.

RickMoen mentioned something about a "sixty second window" for LWN.net posts. I've never noticed any sort of window at all. As soon as I hit "Publish comment" my mistakes are frozen for all eternity.

Sometimes it's a comfort to know that no one really cares about my posts. ;-)

-Steve

Odd iozone results

Posted Jan 31, 2012 16:28 UTC (Tue) by raven667 (subscriber, #5198) [Link]

I think that was a different thing, IIUC Rick just posted something incorrect and someone called him on it 60s before he replied to his own post with the correction and he seemed a little embarrassed and miffed.

Odd iozone results

Posted Jan 31, 2012 8:19 UTC (Tue) by sbergman27 (guest, #10767) [Link] (2 responses)

We're probably focusing a bit too much on random read/write times when there is no evidence or rationale for thinking that the Mixed Workload phase is random read/write. Dave threw that in for reasons which are unclear. There is no evidence from the iozone output, or from its man page, to support the assertion.

If random i/o were the major factor in Mixed Workload, why does moving from a 4k block size take me from such a dismal data rate to nearly the peak sequential read/write data rate for the drive? It makes no sense.

That's the significant question. Why does increasing the record size by a factor of 4 result in such a dramatic increase in throughput?

And BTW, I have not yet explicitly congratulated the XFS team on fact that XFS is now nearly as fast as EXT4 on small system hardware. At least in this particular benchmark. So I will offer my congratulations now.

-Steve

Odd iozone results

Posted Jan 31, 2012 10:59 UTC (Tue) by dlang (guest, #313) [Link]

one thing to remember, in the general case writes can be cached, but the application pauses for reads.

If you do a sequential read test, readahead comes to your rescue, but if you are doing a mixed test, especially where the working set exceeds ram, you can end up with the application stalling waiting for a read and thus introducing additional delays between disk actions.

I don't know exactly what iozone is doing for it's mixed test, but this sort of drastic slowdown is not uncommon.

Also, if you are using a journaling filesystem, there are additional delays as the journal is updated (each write that touches the journal turns in to at least two writes, potentially with an expensive seek between them)

I would suggest running the same test on ext2 (or ext4 with the journal disabled) and see what you get.

Odd iozone results

Posted Jan 31, 2012 17:55 UTC (Tue) by jimparis (guest, #38647) [Link]

> We're probably focusing a bit too much on random read/write times when there is no evidence or rationale for thinking that the Mixed Workload phase is random read/write. Dave threw that in for reasons which are unclear. There is no evidence from the iozone output, or from its man page, to support the assertion.

From what I can tell by reading the source code, it is a mix where half of the threads are doing random reads, and half of the threads are doing random writes.

Odd iozone results

Posted Jan 30, 2012 6:08 UTC (Mon) by sbergman27 (guest, #10767) [Link] (1 responses)

Hi Dave,

Thank you for the reply.

No. There is something else going on, here. If I add "-r 16384" to use 16k records rather than 4k records, the mixed workload test flies:

XFS:
59985 KB/s Write
60182 KB/s Rewrite
58812 KB/s Mixed Workload

Ext4:
69673 KB/s Write
60372 KB/s Rewrite
60678 KB/s Mixed Workload

The 16k record case is at least 60x faster than the 4k record case. Probably more. I find this to be very odd.

You had already covered the metadata case. I didn't want to replicate that. So the fact that this is not metadata intensive was intentional. This is closer XFS's traditional stomping grounds, but with small system hardware rather than large RAID arrays.

"""
The mixed workload does random 4KB write IO - it even tells you that when it starts.
"""

Check again. It doesn't say that at all. Some confusion may have been caused by my cutting and pasting the wrong command line into my post. You can safely ignore the "-i 4" which I believe *does* do random read/writes.

I should also mention that what prompted me to try 16k records was reviewing my aforementioned server benchmark, where it turns out I had specified 16k records.

Rerunning the benchmark on the 8GB ram server with 3 drive RAID1 and with a 16GB dataset, I see similar "seek death" behavior.

-Steve

Odd iozone results

Posted Jan 30, 2012 6:49 UTC (Mon) by sbergman27 (guest, #10767) [Link]

Addendum to my previous post: All the benchmarks I have mentioned use a dataset size that is 2x ram size. And the benchmark I reran on the server is identical to the original, except for the use of 4k records rather than 16k records. That may not have been clear in my previous post.