> Once the working set no longer fits in L3 cache, the Haskell implementation is even neck-and-neck with the implementation of ddotp from GotoBLAS, a collection of highly-tuned BLAS routines hand-written in assembly language that is generally consid- ered to be one of the fastest BLAS implementation available.