Only about a factor of 5 is due to memory optimization. The vector instruction optimization is simply applying more CPU horsepower.
What I can't figure out is how the transposition speeds anything up. The article points out that it removes 1000 non-sequential accesses per column from the multiplication loop, but I see that same 1000 non-sequential accesses per column added to the transposition loop.
matrix multiply optimization
Posted Oct 27, 2007 21:55 UTC (Sat) by bartoldeman (guest, #4205) [Link]
Because transposition uses O(N^2) accesses and multiplication O(N^3). The accesses in the transposition are more expensive but there are N times fewer than in the multiplication...
matrix multiply optimization
Posted Oct 27, 2007 22:41 UTC (Sat) by giraffedata (subscriber, #1954) [Link]
Aha. Perfectly clear now. The article neglects to explain this; I'd probably say, "the original traverses mul2 in this expensive nonsequential way 1000 times, whereas the improved version does it only once."
Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds