<?xml version="1.0" encoding="UTF-8"?>

<rdf:RDF 
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns="http://purl.org/rss/1.0/"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:syn="http://purl.org/rss/1.0/modules/syndication/"
>

  <channel rdf:about="http://lwn.net/headlines/255364/">
    <title>LWN: Comments on "Memory part 5: What programmers can do"</title>
    <link>http://lwn.net/Articles/255364/</link>
    <description>
This is a special feed containing comments posted
to the individual LWN article titled &quot;Memory part 5: What programmers can do&quot;.

    </description>

    <syn:updatePeriod>hourly</syn:updatePeriod>
    <syn:updateFrequency>2</syn:updateFrequency>
    <items>
      <rdf:Seq>
	<rdf:li resource="http://lwn.net/Articles/256939/rss" />
	<rdf:li resource="http://lwn.net/Articles/256903/rss" />
	<rdf:li resource="http://lwn.net/Articles/256563/rss" />
	<rdf:li resource="http://lwn.net/Articles/256250/rss" />
	<rdf:li resource="http://lwn.net/Articles/256248/rss" />
	<rdf:li resource="http://lwn.net/Articles/256247/rss" />
	<rdf:li resource="http://lwn.net/Articles/256246/rss" />
	<rdf:li resource="http://lwn.net/Articles/256217/rss" />
	<rdf:li resource="http://lwn.net/Articles/256075/rss" />
	<rdf:li resource="http://lwn.net/Articles/256033/rss" />
	<rdf:li resource="http://lwn.net/Articles/255843/rss" />
	<rdf:li resource="http://lwn.net/Articles/255737/rss" />
	<rdf:li resource="http://lwn.net/Articles/255686/rss" />
	<rdf:li resource="http://lwn.net/Articles/255646/rss" />
	<rdf:li resource="http://lwn.net/Articles/255609/rss" />
	<rdf:li resource="http://lwn.net/Articles/255608/rss" />
	<rdf:li resource="http://lwn.net/Articles/255603/rss" />
	<rdf:li resource="http://lwn.net/Articles/255594/rss" />
      
      </rdf:Seq>
    </items>

  </channel>
    <item rdf:about="http://lwn.net/Articles/256939/rss">
      <title>Memory part 5: What programmers can do</title>
      <link>http://lwn.net/Articles/256939/rss</link>
      <dc:date>2007-11-02T17:20:19+00:00</dc:date>
      <dc:creator>iulianm</dc:creator>
      <description>
      &lt;div class=&quot;FormattedComment&quot;&gt;&lt;pre&gt;
...ignore the cache dirtying part... I meant loading more cache lines to access mul2 in the
innermost loop, when only one is needed
&lt;/pre&gt;&lt;/div&gt;

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/256903/rss">
      <title>Memory part 5: What programmers can do</title>
      <link>http://lwn.net/Articles/256903/rss</link>
      <dc:date>2007-11-02T15:34:06+00:00</dc:date>
      <dc:creator>iulianm</dc:creator>
      <description>
      &lt;div class=&quot;FormattedComment&quot;&gt;&lt;pre&gt;
Referring to the optimized matrix multiplication code, the text reads:

&lt;font class=&quot;QuotedText&quot;&gt;&amp;gt; k2 and j2 loops are in a different order. This is done since, in the actual &lt;/font&gt;
&lt;font class=&quot;QuotedText&quot;&gt;&amp;gt; computation, only one expression depends on k2 but two depend on j2&lt;/font&gt;

I believe that a better reason for changing the order of the two loops is that this way the
mul2 matrix is traversed by rows instead of by columns, which is the whole point of the
example since it prevents cache dirtying when accessing the elements of this matrix.
&lt;/pre&gt;&lt;/div&gt;

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/256563/rss">
      <title>Memory part 5: What programmers can do</title>
      <link>http://lwn.net/Articles/256563/rss</link>
      <dc:date>2007-10-31T10:40:23+00:00</dc:date>
      <dc:creator>wingo</dc:creator>
      <description>
      &lt;div class=&quot;FormattedComment&quot;&gt;&lt;pre&gt;
I thought the same.
&lt;/pre&gt;&lt;/div&gt;

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/256250/rss">
      <title>matrix multiply optimization</title>
      <link>http://lwn.net/Articles/256250/rss</link>
      <dc:date>2007-10-27T22:41:11+00:00</dc:date>
      <dc:creator>giraffedata</dc:creator>
      <description>
      Aha.  Perfectly clear now.  The article neglects to explain this; I'd probably say, &quot;the original traverses mul2 in this expensive nonsequential way 1000 times, whereas the improved version does it only once.&quot;

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/256248/rss">
      <title>matrix multiply optimization</title>
      <link>http://lwn.net/Articles/256248/rss</link>
      <dc:date>2007-10-27T21:55:06+00:00</dc:date>
      <dc:creator>bartoldeman</dc:creator>
      <description>
      &lt;div class=&quot;FormattedComment&quot;&gt;&lt;pre&gt;
Because transposition uses O(N^2) accesses and multiplication O(N^3). The accesses in the
transposition are more expensive but there are N times fewer than in the multiplication...
&lt;/pre&gt;&lt;/div&gt;

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/256247/rss">
      <title>matrix multiply optimization</title>
      <link>http://lwn.net/Articles/256247/rss</link>
      <dc:date>2007-10-27T21:44:09+00:00</dc:date>
      <dc:creator>giraffedata</dc:creator>
      <description>
      Only about a factor of 5 is due to memory optimization.  The vector instruction optimization is simply applying more CPU horsepower.
&lt;p&gt;
What I can't figure out is how the transposition speeds anything up.  The article points out that it removes 1000 non-sequential accesses per column from the multiplication loop, but I see that same 1000 non-sequential accesses per column added to the transposition loop.

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/256246/rss">
      <title>76.6% speedup</title>
      <link>http://lwn.net/Articles/256246/rss</link>
      <dc:date>2007-10-27T21:32:20+00:00</dc:date>
      <dc:creator>giraffedata</dc:creator>
      <description>
      I think it's ambiguous.  While it's clear that it's a 427% increase in speed and a 77% decrease in slowness, &quot;speed-up&quot; might refer to either of those, and also to the 77% reduction in execution time.
&lt;p&gt;
For practical use, I believe per centages of execution time are more useful than per centages of speed because time is the limited resource.

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/256217/rss">
      <title>Memory part 5: What programmers can do</title>
      <link>http://lwn.net/Articles/256217/rss</link>
      <dc:date>2007-10-27T14:15:50+00:00</dc:date>
      <dc:creator>bartoldeman</dc:creator>
      <description>
      &lt;div class=&quot;FormattedComment&quot;&gt;&lt;pre&gt;
It is even 20 times faster if you use ATLAS 3.8:
&lt;a href=&quot;http://math-atlas.sourceforge.net/&quot;&gt;http://math-atlas.sourceforge.net/&lt;/a&gt;
Its DGEMM routine does the job in around 708,000,000 cycles, another factor of 2 faster (my
other numbers, also on a Core 2 (a Duo, but single threaded), were very similar to Ulrich's so
I can state this with some confidence). Of course there's been a lot of research and tweaking
to obtain this score.

ATLAS' SUMMARY.LOG reports for DGEMM:
 Performance: 4846.05MFLOPS (302.88 percent of of detected clock rate)
and this is on a 1.6GHz Core 2 Duo, not 2.66GHz!

GOTOBLAS may also be worth looking at, for comparison. It looks more into TLB misses than
ATLAS does.
&lt;/pre&gt;&lt;/div&gt;

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/256075/rss">
      <title>Cache-oblivious algorithms</title>
      <link>http://lwn.net/Articles/256075/rss</link>
      <dc:date>2007-10-26T13:30:56+00:00</dc:date>
      <dc:creator>akapoor</dc:creator>
      <description>
      &lt;div class=&quot;FormattedComment&quot;&gt;&lt;pre&gt;
i guess it's worth mentioning harald-prokop's 1999 thesis on &quot;cache oblivious algorithms&quot;
(&lt;a href=&quot;http://citeseer.ist.psu.edu/prokop99cacheobliviou.html&quot;&gt;http://citeseer.ist.psu.edu/prokop99cacheobliviou.html&lt;/a&gt;).

&lt;/pre&gt;&lt;/div&gt;

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/256033/rss">
      <title>Memory part 5: What programmers can do</title>
      <link>http://lwn.net/Articles/256033/rss</link>
      <dc:date>2007-10-26T02:42:41+00:00</dc:date>
      <dc:creator>avassalotti</dc:creator>
      <description>
      &lt;blockquote&gt;&lt;pre&gt;
                Original        Transposed
    Cycles	16,765,297,870  3,922,373,010
    Relative	100%            23.4%

Through the simple transformation of the matrix we can achieve a 76.6% speed-up!
&lt;/pre&gt;&lt;/blockquote&gt;

I could wrong, but I think that is a speedup of 427.4%, not 76.6% 
      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/255843/rss">
      <title>Missing: conventional wisdom turned into lie</title>
      <link>http://lwn.net/Articles/255843/rss</link>
      <dc:date>2007-10-25T10:48:22+00:00</dc:date>
      <dc:creator>mmutz</dc:creator>
      <description>
      &lt;div class=&quot;FormattedComment&quot;&gt;&lt;pre&gt;
&lt;font class=&quot;QuotedText&quot;&gt;&amp;gt; If the array all fits in cache, binary search wins: both linear and&lt;/font&gt;
&lt;font class=&quot;QuotedText&quot;&gt;&amp;gt; binary search, at most, load the data into cache once. If the array is&lt;/font&gt;
&lt;font class=&quot;QuotedText&quot;&gt;&amp;gt; extremely large, binary search still wins despite the wasted reads of&lt;/font&gt;
&lt;font class=&quot;QuotedText&quot;&gt;&amp;gt; entire cache lines, because at some point N/log(N) exceeds the miss&lt;/font&gt;
&lt;font class=&quot;QuotedText&quot;&gt;&amp;gt; penalty.&lt;/font&gt;

This is exactly the conventional wisdom I think isn't true anymore for 
some realistic examples. You're missing the effect of the hardware 
prefetcher, which will, for some sizes, and esp. for cold caches lead to a 
performance increase for the linear search, as opposed to binary search, 
which looks like random access to the prefetcher, and therefore incurs the 
RAM-&amp;gt;cache latency cost typically hidden by the prefetcher.

&lt;/pre&gt;&lt;/div&gt;

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/255737/rss">
      <title>On very small arrays binary search loses - did so for years</title>
      <link>http://lwn.net/Articles/255737/rss</link>
      <dc:date>2007-10-24T20:29:42+00:00</dc:date>
      <dc:creator>khim</dc:creator>
      <description>
      &lt;p&gt;If you just have 4-5 elements in the array then initial complexity of binary search will dominate - with or without memory access effects. At some point binary search will win - but it'll be good exercise to find this point today...&lt;/p&gt;

&lt;p&gt;Actually it's true for most algorithms: bubble sort also wins for small arrays, for example. Usually trivial algorithms work well for small sizes but not-so-well for medium-to-large sizes. Usually it's not worth the effort to optimize for small sizes: it's fast not matter what algorithm you are using. If your program is spending 90% of time searchive 5-element arrays - it's different story, of course, but it's rare in practice...&lt;/p&gt;
      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/255686/rss">
      <title>Missing: conventional wisdom turned into lie</title>
      <link>http://lwn.net/Articles/255686/rss</link>
      <dc:date>2007-10-24T16:21:17+00:00</dc:date>
      <dc:creator>JoeBuck</dc:creator>
      <description>
      If the array all fits in cache, binary search wins: both linear and binary search, at most, load the data into cache once.  If the array is extremely large, binary search still wins despite the wasted reads of entire cache lines, because at some point N/log(N) exceeds the miss penalty.  There may be intermediate values where binary search loses, when the random access causes the cache miss rate to soar and this isn't compensated for adequately by the reduced expected number of element accesses.

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/255646/rss">
      <title>Missing: conventional wisdom turned into lie</title>
      <link>http://lwn.net/Articles/255646/rss</link>
      <dc:date>2007-10-24T12:48:35+00:00</dc:date>
      <dc:creator>mmutz</dc:creator>
      <description>
      &lt;div class=&quot;FormattedComment&quot;&gt;&lt;pre&gt;
One thing I miss (but that maybe comes later) is the effect of all this on 
common-sense optimizations such as binary search. E.g., it would 
be interesting for what kind of working sets binary search (having to look 
random to even current prefetchers) actually starts to perform better than 
linear search.

It would be a nice example of how a O(N) algorithm can dramatically
outperform a O(log N) algorithm, due to memory effects. I'm sure there's 
lots more of common wisdom that turns into a lie on modern processors.

&lt;/pre&gt;&lt;/div&gt;

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/255609/rss">
      <title>cachegrind</title>
      <link>http://lwn.net/Articles/255609/rss</link>
      <dc:date>2007-10-24T04:55:03+00:00</dc:date>
      <dc:creator>JoeBuck</dc:creator>
      <description>
      I would expect to see some mention of cachegrind and related tools, for helping to identify where the cache misses are and getting a good estimate of what they are costing you.  Maybe Uli will address that in a different section.


      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/255608/rss">
      <title>Memory part 5: What programmers can do</title>
      <link>http://lwn.net/Articles/255608/rss</link>
      <dc:date>2007-10-24T04:54:33+00:00</dc:date>
      <dc:creator>alangley</dc:creator>
      <description>
      &lt;div class=&quot;FormattedComment&quot;&gt;&lt;pre&gt;
In Table 6.1, are not the numbers in the second column the wrong way round? The commentary
just below it suggests that the non-temporal case should be faster.
&lt;/pre&gt;&lt;/div&gt;

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/255603/rss">
      <title>Cache-oblivious algorithms</title>
      <link>http://lwn.net/Articles/255603/rss</link>
      <dc:date>2007-10-24T02:55:32+00:00</dc:date>
      <dc:creator>ncm</dc:creator>
      <description>
      It is worth mentioning here what are now called &quot;cache-oblivious&quot; algorithms -- ways to code that get most of the benefits of torturing your code to match all the cache and cache-line sizes without actually knowing what any of those sizes are.  The approach involves traversing data sets in an order that noodles around using a maximally-folded curve, to improve locality regardless of cache boundaries.
&lt;p&gt;The general approach was &lt;a href=&quot;http://ubiety.uwaterloo.ca/~tveldhui/papers/DrDobbs2/drdobbs2.html&quot; &gt;first applied&lt;/a&gt; to matrix operations by Todd Veldhuizen in 1996.  Veldhuizen used it in his Free matrix library &lt;a href=&quot;http://oonumerics.org/blitz/&quot; &gt;Blitz++&lt;/a&gt;, which he demonstrated running 60% faster than IBM's own Fortran running on an IBM vector machine.   Harold Prokop proposed the name that stuck in a paper in 1998.  (Oddly, Prokop et al. have refused to acknowledge Veldhuizen's prior work.)  Veldhuizen is now a professor at U. Waterloo.  The technique is used in the &quot;FFTW&quot; FFT library, and in every modern C++ matrix library, and graduate students everywhere are writing theses on adapting various algorithms to use it.
&lt;p&gt;(Cache-oblivious sorting, by the way, was developed in the '70s, but at the time the &quot;cache&quot; was main memory, of a size similar to the L2 caches today, which spilled to disk or tape.)
&lt;p&gt;As a consequence of its ability to encapsulate these techniques (and many others) in easy-to-use libraries, since about 2000 C++ has been the preferred language for serious numeric computation.  In another recent advance, the (GPL'd) &lt;a href=&quot;http://www.codesourcery.com/vsiplplusplus&quot; &gt;VSIPL++&lt;/a&gt; library encapsulates parallel processing (using e.g. MPI), making it easy to write portable, automatically parallelized matrix and image processing programs.

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/255594/rss">
      <title>Memory part 5: What programmers can do</title>
      <link>http://lwn.net/Articles/255594/rss</link>
      <dc:date>2007-10-23T22:32:15+00:00</dc:date>
      <dc:creator>Coren</dc:creator>
      <description>
      &lt;div class=&quot;FormattedComment&quot;&gt;&lt;pre&gt;
10 times faster on a simple plain matrix multiplication. Amazing.
With optimizing only memory access. 

I did not realise, before this series of articles, that memory can take a such place in
optimisation.

Thanks a lot, it's so interesting and well written.

&lt;/pre&gt;&lt;/div&gt;

      
      </description>
    </item>
</rdf:RDF>

