LWN.net Logo

Writing kernel modules in Haskell

Writing kernel modules in Haskell

Posted Sep 14, 2009 9:22 UTC (Mon) by sagrailo (guest, #48054)
In reply to: Writing kernel modules in Haskell by bjacob
Parent article: Writing kernel modules in Haskell

An example is if eventually a kernel module needs to do matrix computations. C++ or Fortran matrix libraries are much more powerful than C ones (Fortran because of native array arithmetic; C++ because templates allow to do that and much more)...

If you need vector/matrix computation, you need to use BLAS/LAPACK/etc., period; no one should try to re-implement that kind of code on his own. As for general numerical calculations, I'd agree with your statement that so far Fortran compilers were able to generate much better code than C compilers, but I certainly disagree that C++ template numerical libraries are better than C code - I worked with number of them, and these are all crap, regarding performance, readability etc.; this stuff is good only if you're into writing some papers on how smart and wonderful your usage of templates is.


(Log in to post comments)

Writing kernel modules in Haskell

Posted Sep 14, 2009 21:23 UTC (Mon) by gmaxwell (subscriber, #30048) [Link]

"If you need vector/matrix computation, you need to use BLAS/LAPACK/etc., period;"

Any kind? Er, so the kernel should be including megabytes of unswappable finite field matrix algebra library code just to perform raid-6? I think notÂ…

General tools are nice for general things. The kernel is not, almost by definition, a general thing.

Writing kernel modules in Haskell

Posted Sep 15, 2009 10:24 UTC (Tue) by sagrailo (guest, #48054) [Link]

Have you ever actually looked into the layout of a BLAS implementation? Netlib BLAS consists of hundreds of small files, mostly one file per numerical method, and it's easy to pickup needed files and put them in your code. On the other side, Netlib BLAS is admittedly just a reference implementation, not fast at all (and also actually requiring Fortran 77 compiler to build), and things get more complicated with optimized BLAS implementation. For example, with ATLAS, these files are generated, so either you'd have to generate them up-front for all needed processors, or to include the whole infrastructure for auto-generating these into your code (lots of code there, indeed, but if you want to do it right, you'd have actually to write all of this it yourself). As for vendor libraries (like MKL etc.), you may be in better luck with negotiating with vendors to provide you with some kind of license to use needed routines. But in any case you're better with using an existing solution, than to reimplementing it yourself (especially as numerical codes are notoriously hard to get right, which is why it makes even more sense to stick to re-using existing code).

Writing kernel modules in Haskell

Posted Sep 14, 2009 22:50 UTC (Mon) by Ed_L. (guest, #24287) [Link]

I'm sure you have good foundation for your opinions. But my experience, specifically with the Template Numeric Toolkit, is decidedly different. In one application where I implemented Lapack and TNT equivalents side by side, the TNT code was roughly twice as fast as Lapack. This was on CentOS, GCC 4.1.2. I don't recall whether I was linking the system-supplied Lapack or one I custom compiled, and I don't know if the difference was attributable to optimization flag differences or not. FWIW, library code is typically optimized -O2, my TNT segment was -O3 and there were several adjacent function call candidates for successive inline optimization. Perhaps a different example would skew a different direction.

Call me a crappy coder, but I have been singularly unsuccessful using multidimensional C++ arrays without resort to template wrappers, for which there is discernible run-time penalty for only the smallest of arrays. (Numerical Recipes style solutions also work.)

Writing kernel modules in Haskell

Posted Sep 15, 2009 10:35 UTC (Tue) by sagrailo (guest, #48054) [Link]

Yeah, right... I just quickly wrote this code (to multiply two matrices):

#include <algorithm>
#include <cassert>
#include <cstdlib>
#include <ctime>
#include <iostream>

extern "C" {
#include <cblas.h>
}

#include <tnt.h>

#define N 1024

int
main(void)
{
    float* A = new float[N * N];
    float* B = new float[N * N];
    float* C = new float[N * N];
    std::fill(C, C + N * N, 0);

    TNT::Array2D<float< ATNT(N, N);
    TNT::Array2D<float< BTNT(N, N);
    TNT::Array2D<float< CTNT(N, N);

    for (int i = 0; i < N; ++i)
        for (int j = 0; j < N; ++j) {
            A[i * N + j] = ATNT[i][j] = 1;
            B[i * N + j] = BTNT[i][j] = 1;
        }

    clock_t start, end;

    start = clock();
    cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, N, N, N, 1, A, N, B,
                N, 0, C, N);
    end = clock();
    std::cerr
        << "ATLAS: "
        << (float) (end - start) / CLOCKS_PER_SEC
        << "s"
        << std::endl;

    start = clock();
    CTNT = TNT::matmult(ATNT, BTNT);
    end = clock();
    std::cerr
        << "TNT:   "
        << (float) (end - start) / CLOCKS_PER_SEC
        << "s"
        << std::endl;

    for (int i = 0; i < N; ++i)
        for (int j = 0; j < N; ++j) {
            assert(C[i * N + j] == N);
	    assert(CTNT[i][j] == N);
        }

    delete [] A;
    delete [] B;
    delete [] C;
}

Complied with (I've put TNT into /tmp, and I have ATLAS installed in /usr/local):

g++ -O3 -o foo -I/tmp/tnt foo.cpp -lcblas -latlas

The output, on my old ThinkPad T61p:

ATLAS: 0.22s
TNT:   17.64s
So - C++ almost two orders of magnitude slower (and that's typical, in my experience).

Writing kernel modules in Haskell

Posted Sep 15, 2009 12:01 UTC (Tue) by jbh (subscriber, #494) [Link]

Interesting! I've wanted to look into Eigen as well, which is another
template linear algebra library (http://eigen.tuxfamily.org). It seems to
hold up pretty well against Atlas (on Thinkpad T60, compiled with -msse2
flag):

ATLAS: 0.43s
TNT: 32.25s
Eigen: 0.76s

The major additions to your test program are (in different places):

#include <Eigen/Core>
--
MatrixXf Aeig(N,N), Beig(N,N), Ceig(N,N);
Aeig.setOnes(); // or Aeig = MatrixXf::Ones(N,N);
Beig.setOnes();
Ceig.setZero();
--
Ceig = Aeig * Beig;
--
assert(Ceig(i,j) == N);

Writing kernel modules in Haskell

Posted Sep 15, 2009 12:50 UTC (Tue) by bjacob (subscriber, #58566) [Link]

Hm, small world, I'm actually one of the main developers of Eigen. I didn't want to make a plug, but now that you've mentioned it...

I'm surprised by your result, because in our experience, we are consistently faster than ATLAS for matrix products. Already with Eigen 2.0, and even more so in the development branch. In fact, on single-CPU systems, we have almost the performance of vendor-tuned BLAS.

development branch:
http://eigen.tuxfamily.org/index.php?title=Benchmark

2.0:
http://eigen.tuxfamily.org/index.php?title=Benchmark-Augu...

It's good that you passed -msse2, indeed.

Some things to consider:

* did you allow ATLAS to use more than one thread? That would probably not be applicable in the setting of a kernel module! (As far as I understand) so perhaps it's more fair to require ATLAS to use only one thread.

* did you tune ATLAS for your hardware, especially CPU cache size? You can do the same for Eigen, then. For example with the development version it's
EIGEN_TUNE_FOR_CPU_CACHE_SIZE, see in the file Macros.h.

Writing kernel modules in Haskell

Posted Sep 15, 2009 13:09 UTC (Tue) by jbh (subscriber, #494) [Link]

In short, no tuning at all, neither ATLAS or Eigen. Standard ubuntu (jaunty)
packages for everything. This was on my laptop, and I only bother to tune
for production runs; those are not on my laptop :)

Eigen already looks very impressive, for a relatively young project in a
problem space that's dominated by dinosaurs --- I've meant to look into it
for a while, but not enough time and all that, and last time I checked it
didn't have sparse matrix support (I see it does now, although
experimental).

Writing kernel modules in Haskell

Posted Sep 15, 2009 13:29 UTC (Tue) by bjacob (subscriber, #58566) [Link]

OK, my best guess then is that ATLAS by default uses more than one thread...
anyway thanks for the kind words. Yes, there are a lot of dinosaurs in this area and we're sort of the small mammals in comparison.

(If it wasn't obvious, the largest advantage of a c++ template library like Eigen over a large C or Fortran binary library like a BLAS, is that it spontaneously generates only the code that's needed at build-time, and after that you can forget about it, so your kernel module is only a few dozen kilobytes self-contained; by comparison vendor-tuned BLAS take dozens of megabytes).

Writing kernel modules in Haskell

Posted Sep 29, 2009 23:14 UTC (Tue) by Auders (subscriber, #53318) [Link]

I did the same test with standard jaunty packages and the same compilation parameters and for me Eigen outperformed ATLAS, as you expected:
ATLAS: 0.69s
TNT: 21.93s
Eigen: 0.56s

Matrix Libraries

Posted Sep 16, 2009 20:50 UTC (Wed) by cry_regarder (subscriber, #50545) [Link]

Of course ublas has bindings for atlas so you can use this approach:

...
#include <boost/numeric/ublas/matrix.hpp>
#include <boost/numeric/bindings/atlas/cblas.hpp>
#include <boost/numeric/bindings/traits/ublas_matrix.hpp>
...
namespace ublas = boost::numeric::ublas;
namespace atlas = boost::numeric::bindings::atlas;
...
atlas::gemm(CblasNoTrans, CblasNoTrans, 1.0,
Aublas, Bublas, 0.0, Cublas);
...

and this compile:
g++ -O3 -o mat mat.c++ -I/tmp/boost-numeric-bindings -L /usr/lib64/atlas -latlas -lcblas -DNDEBUG

to get run times like:

ATLAS: 0.16s
uBLAS: 0.16s

...

Cry

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds