Odd iozone results
Odd iozone results
Posted Jan 30, 2012 0:15 UTC (Mon) by dgc (subscriber, #6611)In reply to: Odd iozone results by sbergman27
Parent article: XFS: the filesystem of the future?
> I decided to take Dave's claim of XFS being ood for small systems
> seriously.
....
> I ran:
>
> iozone -i0 -i1 -i2 -i4 -i8 -l 4 -u 4 -F 1 2 3 4 -s 2g
That is a data IO only workload - it has no metadata overhead at all. You're comparing apples to oranges considering the talk was all about metadata performance....
> I stopped the mixed workload test after 45 minutes.
.....
> What in the world is going on here???
The mixed workload does random 4KB write IO - it even tells you that when it starts. Your workload doesn't fit in memory, so it will run at disk speed. 700KB/s is 175 4k IOs per second, which is about 6ms per IO. That's close to the average seek time of a typical 7200rpm SATA drive. Given that your workload is writing 8GB of data at that speed, it will take around 3 hours to run to completion.
IOWs, there is nothing wrong with your system - it's writing data in exactly the way you asked it to do, albeit slowly. This is not the sort of problem that an experienced storage admin would fail to diagnose....
Dave.
Posted Jan 30, 2012 2:48 UTC (Mon)
by raven667 (subscriber, #5198)
[Link] (17 responses)
in my experience the average server admin wouldn't be able to characterize or understand this issue. I've seen a lot of magical thinking from admins and developers when it comes to storage, sometimes for networking as well although the diagnostics tend to be better for that (wireshark, tcpdump).
Posted Jan 30, 2012 6:10 UTC (Mon)
by sbergman27 (guest, #10767)
[Link] (1 responses)
Posted Jan 30, 2012 23:08 UTC (Mon)
by raven667 (subscriber, #5198)
[Link]
Posted Jan 30, 2012 22:07 UTC (Mon)
by sbergman27 (guest, #10767)
[Link] (14 responses)
Firstly, for context, please read my most recent post to Dave.
I/O scheduler issues aside, the idea of the problem being, fundamentally, raw seek time is out the window.
Else how would it perform so well as it does with 16k records rather than 4k records? A 4x improvement might be expected. Not 60+ times.
-Steve
Posted Jan 30, 2012 23:15 UTC (Mon)
by raven667 (subscriber, #5198)
[Link] (13 responses)
Posted Jan 31, 2012 3:49 UTC (Tue)
by sbergman27 (guest, #10767)
[Link] (12 responses)
http://lwn.net/Articles/477865/
I say 60x because vmstat was reporting 700k to 1000k per second. Real throughput may or may not have been far worse. It looked like it was going to take hours so I stopped it after 45 minutes.
The track to track seek on this drive is ~0.8ms. And /sys/block/sdX/queue/nr_requests is at the default 128. (Changing it to 200,000 doesn't help.) At 128 one would expect one pass over the drive platter surface, read/writing random requests, to take less than 0.1 seconds and read/write 1 MB for a read/write total of ~ 10,000KB/s. This assumes all requests to be on different cylinders.
Clearly, however, Dave is incorrect about Mixed Workload being a random read/write test. There is no way I would be seeing 60,000KB/s throughput with a 16k record size on a dataset 2x the size of ram on such a workload. And in the 4k record case I would expect to see a lot better than I do.
Clearly there is something odd going on. I'm curious what it might be.
-Steve
Posted Jan 31, 2012 4:56 UTC (Tue)
by raven667 (subscriber, #5198)
[Link] (11 responses)
https://lwn.net/Articles/478014/
But I don't see where you get 60x. The second might be doing 4x more work completed per io leading to fewer iops, less expensive seeks needed to complete the test if the total data size it is re/writing is fixed
The track to track seek is 0.8 but that is the minimum for tracks which are adjacent, the latency goes up the farther away the seek distance. Average is probably closer to 6. Your estimates of a full seek across the disk are way off, that is probably closer to 15 or more and that's if you don't stop and write data 128 times along the way, like stopping at every bathroom on a road trip. That's what we mean when we say that the drive can probably only do that about 175 times a second which is the limiting factor, the time it actually takes to read/write a track is less (but of course not zero)
I haven't thought about why you are getting the specific numbers you see, I have mostly used sysbench and not iozone, but the vmstat numbers seem fine for a random io workout
Posted Jan 31, 2012 5:43 UTC (Tue)
by sbergman27 (guest, #10767)
[Link] (10 responses)
The elevator algorithm basically takes you from the worst case, totally random 8ms per request time, down closer to the 0.8ms track to track seek time. That's the whole purpose of the elevator algorithm, and it works no matter how large your dataset is in relation to system ram.
Each read/write total on the affected tracks is going to get you 4k of read and 4k of write, for a total of 8k per track. 128 seeks at, 0.8ms per seek works out to 0.1 seconds for a pass over the surface. And 128 requests at 8k r/w per request works out to ~10mb/s. Now, I might should have included an extra 1/120th second to account for an extra rotation of the platter between read and write. It really all depends upon exactly what the random r/w benchmark is doing. If we do that now, we get 5MB/s.
What vmstat is showing is 600k/s - 1000k/s read. And about 2k/s write. You don't think that's odd? And doesn't it seem odd that increasing the record size by 4x increases the throughput by 60+ times. (Again, I stopped the 4k test after 45 minutes and am basing that, optimistically, upon the vmstat numbers, so the reality is uncertain. The 16k record Mixed Workload test takes 2.5 minutes to complete.)
At any rate, the Mixed Workload test is *not* a random read/write benchmark. Somewhere Dave got the idea that it said it was when it started. But it does not. And that does not even make sense in light of the *name* of the benchmark. Unfortunately, the man page for iozone is of little help. It just says it's a "Mixed Workload".
I get similar results when running Mixed Workload with 4k records on the server rather than the desktop machine.
Are you beginning to appreciate the puzzle now? I've tried to make it as clear as possible. If I have not been clear enough on any particular point, please let me know and I will explain further.
-Steve
Posted Jan 31, 2012 6:55 UTC (Tue)
by jimparis (guest, #38647)
[Link] (2 responses)
You're forgetting about rotational latency. After arriving at a particular track, you still need the platter to rotate to the angle at which your requested data is actually stored. For a 7200 RPM disk, that's an average of 4ms (half a rotation). I don't think you can expect to do better than that.
> 128 seeks at, 0.8ms per seek works out to 0.1 seconds for a pass over the surface. And 128 requests at 8k r/w per request works out to ~10mb/s.
With 4.8ms per seek, that same calculation gives ~1600kb/s. Which is only twice what you were seeing, and it was assuming that you're really hitting that 0.8ms track seek time. I'd really expect that the time for a seek over 0.1% of the disk surface is actually quite a bit higher than the time for seek between adjacent tracks (which is something like ~0.0001% of the disk surface, assuming 1000 KTPI and 1 inch usable space).
Posted Jan 31, 2012 7:59 UTC (Tue)
by sbergman27 (guest, #10767)
[Link] (1 responses)
Doing a linear interpolation between for 0.8ms to seek over 0.8% of the platter, and 8ms to seek over 50% of the platter, yields ~0.9ms for the seek. (And then we can add the 4ms for 1/2 rotation.) Whether using a linear interpolation is justified here is another matter.
However, let's not forget that there is *no* reason to think that the "Mixed Workload" phase is pure random read/write. The whole random seek thing is a separate side-question WRT the actual benchmark numbers I get. If the numbers *are* purely random access time for the drive, why do 4k records get me such incredibly dismal results, and 16 records get me near the peak sequential read/write rate of the drive? One would expect the random seek read/write rate to scale linearly with the record size.
BTW, thanks for the post. This has been a very interesting exercise.
-Steve
Posted Feb 2, 2012 20:52 UTC (Thu)
by dgc (subscriber, #6611)
[Link]
You should run strace on it and see what the test is really doing. That's what I did, and it's definitely doing random writes followed by random reads at the syscall level....
Dave.
Posted Jan 31, 2012 7:36 UTC (Tue)
by raven667 (subscriber, #5198)
[Link] (6 responses)
Posted Jan 31, 2012 7:40 UTC (Tue)
by raven667 (subscriber, #5198)
[Link] (2 responses)
Posted Jan 31, 2012 8:30 UTC (Tue)
by sbergman27 (guest, #10767)
[Link] (1 responses)
I'm in Chrome now. And it caught "speeling" quite handily.
RickMoen mentioned something about a "sixty second window" for LWN.net posts. I've never noticed any sort of window at all. As soon as I hit "Publish comment" my mistakes are frozen for all eternity.
Sometimes it's a comfort to know that no one really cares about my posts. ;-)
-Steve
Posted Jan 31, 2012 16:28 UTC (Tue)
by raven667 (subscriber, #5198)
[Link]
Posted Jan 31, 2012 8:19 UTC (Tue)
by sbergman27 (guest, #10767)
[Link] (2 responses)
If random i/o were the major factor in Mixed Workload, why does moving from a 4k block size take me from such a dismal data rate to nearly the peak sequential read/write data rate for the drive? It makes no sense.
That's the significant question. Why does increasing the record size by a factor of 4 result in such a dramatic increase in throughput?
And BTW, I have not yet explicitly congratulated the XFS team on fact that XFS is now nearly as fast as EXT4 on small system hardware. At least in this particular benchmark. So I will offer my congratulations now.
-Steve
Posted Jan 31, 2012 10:59 UTC (Tue)
by dlang (guest, #313)
[Link]
If you do a sequential read test, readahead comes to your rescue, but if you are doing a mixed test, especially where the working set exceeds ram, you can end up with the application stalling waiting for a read and thus introducing additional delays between disk actions.
I don't know exactly what iozone is doing for it's mixed test, but this sort of drastic slowdown is not uncommon.
Also, if you are using a journaling filesystem, there are additional delays as the journal is updated (each write that touches the journal turns in to at least two writes, potentially with an expensive seek between them)
I would suggest running the same test on ext2 (or ext4 with the journal disabled) and see what you get.
Posted Jan 31, 2012 17:55 UTC (Tue)
by jimparis (guest, #38647)
[Link]
From what I can tell by reading the source code, it is a mix where half of the threads are doing random reads, and half of the threads are doing random writes.
Posted Jan 30, 2012 6:08 UTC (Mon)
by sbergman27 (guest, #10767)
[Link] (1 responses)
Thank you for the reply.
No. There is something else going on, here. If I add "-r 16384" to use 16k records rather than 4k records, the mixed workload test flies:
XFS:
Ext4:
The 16k record case is at least 60x faster than the 4k record case. Probably more. I find this to be very odd.
You had already covered the metadata case. I didn't want to replicate that. So the fact that this is not metadata intensive was intentional. This is closer XFS's traditional stomping grounds, but with small system hardware rather than large RAID arrays.
"""
Check again. It doesn't say that at all. Some confusion may have been caused by my cutting and pasting the wrong command line into my post. You can safely ignore the "-i 4" which I believe *does* do random read/writes.
I should also mention that what prompted me to try 16k records was reviewing my aforementioned server benchmark, where it turns out I had specified 16k records.
Rerunning the benchmark on the 8GB ram server with 3 drive RAID1 and with a 16GB dataset, I see similar "seek death" behavior.
-Steve
Posted Jan 30, 2012 6:49 UTC (Mon)
by sbergman27 (guest, #10767)
[Link]
Odd iozone results
The mixed workload does random 4KB write IO - it even tells you that when it starts. Your workload doesn't fit in memory, so it will run at disk speed. 700KB/s is 175 4k IOs per second, which is about 6ms per IO. That's close to the average seek time of a typical 7200rpm SATA drive. Given that your workload is writing 8GB of data at that speed, it will take around 3 hours to run to completion.
IOWs, there is nothing wrong with your system - it's writing data in exactly the way you asked it to do, albeit slowly. This is not the sort of problem that an experienced storage admin would fail to diagnose....
Odd iozone results
Odd iozone results
Odd iozone results
Odd iozone results
Odd iozone results
Odd iozone results
Odd iozone results
Odd iozone results
Odd iozone results
Odd iozone results
> "Mixed Workload" phase is pure random read/write
Odd iozone results
Odd iozone results
Odd iozone results
Odd iozone results
Odd iozone results
Odd iozone results
Odd iozone results
Odd iozone results
59985 KB/s Write
60182 KB/s Rewrite
58812 KB/s Mixed Workload
69673 KB/s Write
60372 KB/s Rewrite
60678 KB/s Mixed Workload
The mixed workload does random 4KB write IO - it even tells you that when it starts.
"""
Odd iozone results