It is worth mentioning here what are now called "cache-oblivious" algorithms -- ways to code that get most of the benefits of torturing your code to match all the cache and cache-line sizes without actually knowing what any of those sizes are. The approach involves traversing data sets in an order that noodles around using a maximally-folded curve, to improve locality regardless of cache boundaries.
The general approach was first applied to matrix operations by Todd Veldhuizen in 1996. Veldhuizen used it in his Free matrix library Blitz++, which he demonstrated running 60% faster than IBM's own Fortran running on an IBM vector machine. Harold Prokop proposed the name that stuck in a paper in 1998. (Oddly, Prokop et al. have refused to acknowledge Veldhuizen's prior work.) Veldhuizen is now a professor at U. Waterloo. The technique is used in the "FFTW" FFT library, and in every modern C++ matrix library, and graduate students everywhere are writing theses on adapting various algorithms to use it.
(Cache-oblivious sorting, by the way, was developed in the '70s, but at the time the "cache" was main memory, of a size similar to the L2 caches today, which spilled to disk or tape.)
As a consequence of its ability to encapsulate these techniques (and many others) in easy-to-use libraries, since about 2000 C++ has been the preferred language for serious numeric computation. In another recent advance, the (GPL'd) VSIPL++ library encapsulates parallel processing (using e.g. MPI), making it easy to write portable, automatically parallelized matrix and image processing programs.
Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds