Considering that you doubled the sizes of your nodes, I'm a bit surprised that the 64-bit version only took 67% longer (going by maximum 64-bit time vs. minimum 32-bit time). If that's the worst case, perhaps the performance benefits of 32-bit pointers really are somewhat exaggerated, at least for programs which aren't dealing exclusively with pointers.
Posted Jun 8, 2012 21:31 UTC (Fri) by jzbiciak (✭ supporter ✭, #5246)
[Link]
Interestingly, if I make the array larger to stress the L2 cache, the difference get smaller (14s vs 16s for 100 iterations with (1<<21) nodes). I imagine that's due to some of the following facts:
The r-m-w on the datum is guaranteed to hit L1 after ->next gets brought in, regardless of pointer size, which means this operation is equal cost for 32-bit and 64-bit and can largely be ignored, save for the victim writebacks it generates.
The subsequent loop iteration (ie. accessing the next structure in the list) is much, much more likely to miss L1D regardless of pointer size with the larger data set, and somewhat more likely to miss L2. This tends to equalize 32-bit and 64-bit performance. (See analysis below).
The 64-bit version shows less relative bandwidth amplification due to cache writebacks than the 32-bit version. That is, with my CPU's 64-byte linesize, an r-m-w on an 8 byte structure could generate a 64-byte writeback (8x amplification), whereas the relative ratio for a 16 byte structure is half that. A different way of thinking about it is that the total number of bytes written due to cache writebacks for both versions should be fairly similar if they have similar hit rates. Their hit rates converge as the dataset grows beyond the cache size.
Anyway, we see folks tilt at much shorter windmills than 67% all the time. :-) A 5% to 10% speedup might be interesting to some, especially if it translates to something like increased battery life. A 67% speedup in a key bit of code would be huge for some, but that may indeed be near the peak difference you might expect. In the end, I guess it'll be determined by benchmarking, one hopes.
Some more analysis on the bullets above: Let's just consider the L1D cache, and assume everything hits L2. If we first assume that, then the steady state cost of each p->data++; p = p->next amounts to am L1D linefill plus a victim writeback from L1D to L2 for the replaced line. In this case, then the cost for 32-bit and 64-bit versions should be identical, since every dereference incurs a miss and a victim writeback of the same amount of data.
To see a difference between the 32-bit and 64-bit versions, therefore, you need to take the hit-rate into account. Let's suppose the 32-bit version fits perfectly in L1D, but the 64-bit version (because it's twice the size) only fits halfway. Now none of the 32-bit requests miss, but half of the 64-bit requests do. The 32-bit version incurs no L1D miss penalty and no victim writeback penalty, while the 64-bit version, on average, incurs both on half of its dereferences.
If we continue to reduce the size, eventually both versions fit entirely again, and are once again on an even footing. This suggests at the endpoints of the curve (all accesses hit and all accesses miss), the two perform more or less equivalently, at least for this benchmark. Through the transition band, though, the 64-bit version starts degrading sooner, and the 32-bit version asymptotically approaches its performance in the long tail.
The hit-rate expressions (expressing the hit rate for dereferencing *p->next) for both, assuming no pathological cache behavior and a good random ordering of list nodes and a dataset larger than L1D, should be something along the lines of: hit_rate = size_of_L1D / total_dataset. Now, this implies the hit rate will always be double for the smaller pointer size, because total_dataset would be half the size.
But, the performance will not double if the miss rates are high, because misses are expensive. If we say that the cost of a hit is k1 and the cost of a miss is k2, then the total cost will be (k1 * hit_rate + k2 * (1 - hit_rate)). Suppose for sake of argument that k2 = 10 * k1 and our hit rate is only 10% for 32-bit pointers and 5% for 64-bit pointers. (This ratio of k1 to k2 is fairly reasonable to a first order for modern architectures.) For 32-bit, the cost would be (1 * 10% + 10 * 90%) = 9.1. For 64-bit pointers, the cost would be (1 * 5% + 10 * 95%) = 9.55. You can see how they'd asymptotically approach, since the cost of the misses dominate any gains made by the hits, and doubling the hits does not halve the number of misses.
The picture is quite a bit better for 32-bit if the hit rates are higher though. Suppose the hit rate was 90% for 32-bit pointers and only 45% for 64-bit pointers. Now you have (1 * 90% + 10 * 10%) = 1.9vs (1 * 45% + 10 * 55%) = 5.95.
Maybe if I get bored later, I could modify my program to collect a sweep of such datapoints. It might be enlightening.
It certainly suggests that 64-bit pointers aren't automatic death for performance. It also suggests that the gains 32-bit pointers might show are rather sensitive to how well your application fits in the cache to begin with, and how far the increased pointer size pushes you from "fitting" toward "not fitting". If you can tune your application to work on subproblems, it may be that you can tune both 32-bit and 64-bit variants to achieve nearly identical performance if you can make both utilize L1 effectively.