|
|

Memory part 5: What programmers can do

Memory part 5: What programmers can do

Posted Oct 23, 2007 22:32 UTC (Tue) by Coren (guest, #39136)
Parent article: Memory part 5: What programmers can do

10 times faster on a simple plain matrix multiplication. Amazing.
With optimizing only memory access.

I did not realise, before this series of articles, that memory can take a such place in
optimisation.

Thanks a lot, it's so interesting and well written.

Memory part 5: What programmers can do

Posted Oct 27, 2007 14:15 UTC (Sat) by bartoldeman (guest, #4205) [Link]

It is even 20 times faster if you use ATLAS 3.8:
http://math-atlas.sourceforge.net/
Its DGEMM routine does the job in around 708,000,000 cycles, another factor of 2 faster (my
other numbers, also on a Core 2 (a Duo, but single threaded), were very similar to Ulrich's so
I can state this with some confidence). Of course there's been a lot of research and tweaking
to obtain this score.

ATLAS' SUMMARY.LOG reports for DGEMM:
Performance: 4846.05MFLOPS (302.88 percent of of detected clock rate)
and this is on a 1.6GHz Core 2 Duo, not 2.66GHz!

GOTOBLAS may also be worth looking at, for comparison. It looks more into TLB misses than
ATLAS does.

matrix multiply optimization

Posted Oct 27, 2007 21:44 UTC (Sat) by giraffedata (subscriber, #1954) [Link]

Only about a factor of 5 is due to memory optimization. The vector instruction optimization is simply applying more CPU horsepower.

What I can't figure out is how the transposition speeds anything up. The article points out that it removes 1000 non-sequential accesses per column from the multiplication loop, but I see that same 1000 non-sequential accesses per column added to the transposition loop.

matrix multiply optimization

Posted Oct 27, 2007 21:55 UTC (Sat) by bartoldeman (guest, #4205) [Link]

Because transposition uses O(N^2) accesses and multiplication O(N^3). The accesses in the
transposition are more expensive but there are N times fewer than in the multiplication...

matrix multiply optimization

Posted Oct 27, 2007 22:41 UTC (Sat) by giraffedata (subscriber, #1954) [Link]

Aha. Perfectly clear now. The article neglects to explain this; I'd probably say, "the original traverses mul2 in this expensive nonsequential way 1000 times, whereas the improved version does it only once."