Only about a factor of 5 is due to memory optimization. The vector instruction optimization is simply applying more CPU horsepower.
What I can't figure out is how the transposition speeds anything up. The article points out that it removes 1000 non-sequential accesses per column from the multiplication loop, but I see that same 1000 non-sequential accesses per column added to the transposition loop.
Copyright © 2018, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds