Referring to the optimized matrix multiplication code, the text reads:
> k2 and j2 loops are in a different order. This is done since, in the actual
> computation, only one expression depends on k2 but two depend on j2
I believe that a better reason for changing the order of the two loops is that this way the
mul2 matrix is traversed by rows instead of by columns, which is the whole point of the
example since it prevents cache dirtying when accessing the elements of this matrix.