A Gentoo x32 release candidate

Posted Jun 6, 2012 19:29 UTC (Wed) by gmaxwell (guest, #30048)
In reply to: A Gentoo x32 release candidate by jzbiciak
Parent article: A Gentoo x32 release candidate

I just responded correcting the inaccurate claim that it gives you the best of both (inaccurate because the a major advantage of 64 bit is the increased memory space, and because multilib bloat will likely washes the advantage of x32 unless you are x32 only).

This point is of to me because if the advantages are misunderstood distributions which offer less configuration may adopt it in their default x86_64 configurations and send us back to the bad old days of 4gb limits.

I'm happy to see Gentoo offering it as may be pretty interesting in embedded devices. I apologize for the bit of hijacking here.

A Gentoo x32 release candidate

Posted Jun 7, 2012 10:19 UTC (Thu) by teknohog (guest, #70891) [Link] (6 responses)

> if the advantages are misunderstood distributions which offer less configuration may adopt it in their default x86_64 configurations and send us back to the bad old days of 4gb limits.

I agree that x32 has its technical benefits, but it is a nightmare for the consumer/marketing side.

For years, people have been educated on the advantages of 64-bit systems, and now that we are mostly done, we want to confuse things again with yet another ABI. Which is apparently worse at only half the bits. Also, it will take time for closed software vendors to release x32 versions. When they catch up, people will have fun choosing from 3 different binaries.

On the technical side, x32 feels like overoptimization. We already trade some performance for overall convenience, for example by using higher-level languages and libraries. x86-64 already handles everything I currently do with computers, it would be weird to go back to something that does "almost everything", plus extra libraries for the rest.

As for the 4 GB limit per process, there is probably some quote involving 640 KB.

A Gentoo x32 release candidate

Posted Jun 7, 2012 20:08 UTC (Thu) by slashdot (guest, #22014) [Link] (5 responses)

x32 is just a faster version of x86, and so doesn't add much complexity.

32-bit-only x86 CPUs have already been very rare for a while, so x86 on Linux will hopefully die soon after x32 is released, resulting in a x32+x64 world instead of x86+x64, now with a single kernel architecture.

x86 userland will probably survive forever on Windows though, unless Microsoft decides to introduce x32 as well.

A Gentoo x32 release candidate

Posted Jun 11, 2012 13:10 UTC (Mon) by nix (subscriber, #2304) [Link] (4 responses)

32-bit-only x86 CPUs have already been very rare for a while

Uh, Atom?

A Gentoo x32 release candidate

Posted Jun 11, 2012 14:42 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

New Atoms support x86_64.

A Gentoo x32 release candidate

Posted Jun 11, 2012 22:01 UTC (Mon) by nix (subscriber, #2304) [Link] (1 responses)

Excellent! Maybe we *can* get rid of x86 then. I was resigned to its being immortal...

A Gentoo x32 release candidate

Posted Jun 11, 2012 22:05 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

Well, there's NaCl remaining which uses x86 code.

Besides, amd64 is not much better than x86.

A Gentoo x32 release candidate

Posted Jun 12, 2012 22:45 UTC (Tue) by BenHutchings (subscriber, #37955) [Link]

The desktop Atom models are all 64-bit capable now, but there are still mobile and embedded models available that use older 32-bit cores.

A Gentoo x32 release candidate

Posted Jun 8, 2012 10:16 UTC (Fri) by roblucid (guest, #48964) [Link] (7 responses)

"I just responded correcting the inaccurate claim that it gives you the best of both (inaccurate because the a major advantage of 64 bit is the increased memory space, and because multilib bloat will likely washes the advantage of x32 unless you are x32 only)."

Actually IMO the increased registers, better DMA & higher CPU feature set base eg) SSE2 & higher resolution timers are the major advantage of AMD64. But back in the 90's, I told a Sun Sales droid that 64 bit suffered pointer bloat (a typical machine back then had 16-64MiB RAM).

Even today none of the daily desktop applications I run are using >3 GiB RAM, so having a x86_64 bit kernel with an x32 bit userland, is a practicable optimisation. Compilers or RDBMS servers, which actually use > 3GiB tend to run on atypical boxen, if you can run those without worrying about the extra x86_64 page faults, then the "bloat" due to duplicated system libraries is not a significant issue.

What's ironic about x32, is it's come so late; a 64 bit kernel with 32 bit userland would have been a good transitional step; now RAM & extra cores are so cheap even low spec machines have minimum 3GiB dual so most people won't noticeably benefit.

A Gentoo x32 release candidate

Posted Jun 8, 2012 17:03 UTC (Fri) by teknohog (guest, #70891) [Link] (6 responses)

> What's ironic about x32, is it's come so late; a 64 bit kernel with 32 bit userland would have been a good transitional step; now RAM & extra cores are so cheap even low spec machines have minimum 3GiB dual so most people won't noticeably benefit.

True. It is easy to refer to other architectures like MIPS and Power that had an x32-like setup many years ago, but it was also a time of less capable hardware, so it was a more sensible optimization.

A Gentoo x32 release candidate

Posted Jun 8, 2012 17:17 UTC (Fri) by jzbiciak (guest, #5246) [Link]

That's very true.

That said, L1D caches haven't gotten larger in that intervening time frame, and I'd suggest their impact on performance is still rather noticeable. L1Ds still seem to hover between 16K (on Zambezi, for example) and 64K (previous K10s).

A Gentoo x32 release candidate

Posted Jun 8, 2012 17:46 UTC (Fri) by jzbiciak (guest, #5246) [Link] (4 responses)

For fun, I put together this admittedly very contrived benchmark just to compare the difference of 4 and 8 byte pointer sizes in perhaps the worst possible L1D thrashing scenario. Please don't laugh too much at my code. I wrote it in a hurry in the last 10 minutes.

What this code does, in short, is construct a scrambled linked list of structs, each containing simply a "next" pointer and a pointer to char. I step through the scrambled list incrementing the pointer to char on each element. (I marked the pointer itself as volatile so it wouldn't be dead-coded.)

I compiled the code as native 64-bit and as x86 (not x32), and let it run 10 trials of the benchmark loop each. (gcc -O3 -fomit-frame-pointer in both cases; Only difference is that I used -m32 for the 32-bit version.)

Here's the results. I'll let you guess which column is 32 bit and which one is 64 bit.

  1688.600ms      2606.294ms
  1671.547ms      2561.276ms
  1670.577ms      2626.574ms
  1668.617ms      2599.231ms
  1621.522ms      2193.314ms
  1573.468ms      2108.417ms
  1669.220ms      2266.626ms
  1668.592ms      2507.869ms
  1624.254ms      2195.270ms
  1675.467ms      2611.658ms

Now, I didn't try x32 (I don't have that set up anywhere yet), but I wouldn't expect this simple benchmark to show any benefit for x32 over x86 given its rather narrow scope. The main point was to highlight that L1D cache pollution due to bloated pointers can also be a noticeable factor in some programs.

Admittedly, my focused benchmark probably overstates the effect relative to the vast majority of programs, but I thought data might be interesting nonetheless.

A Gentoo x32 release candidate

Posted Jun 8, 2012 20:27 UTC (Fri) by nybble41 (subscriber, #55106) [Link] (1 responses)

Considering that you doubled the sizes of your nodes, I'm a bit surprised that the 64-bit version only took 67% longer (going by maximum 64-bit time vs. minimum 32-bit time). If that's the worst case, perhaps the performance benefits of 32-bit pointers really are somewhat exaggerated, at least for programs which aren't dealing exclusively with pointers.

A Gentoo x32 release candidate

Posted Jun 8, 2012 21:31 UTC (Fri) by jzbiciak (guest, #5246) [Link]

Interestingly, if I make the array larger to stress the L2 cache, the difference get smaller (14s vs 16s for 100 iterations with (1<<21) nodes). I imagine that's due to some of the following facts:

The r-m-w on the datum is guaranteed to hit L1 after ->next gets brought in, regardless of pointer size, which means this operation is equal cost for 32-bit and 64-bit and can largely be ignored, save for the victim writebacks it generates.
The subsequent loop iteration (ie. accessing the next structure in the list) is much, much more likely to miss L1D regardless of pointer size with the larger data set, and somewhat more likely to miss L2. This tends to equalize 32-bit and 64-bit performance. (See analysis below).
The 64-bit version shows less relative bandwidth amplification due to cache writebacks than the 32-bit version. That is, with my CPU's 64-byte linesize, an r-m-w on an 8 byte structure could generate a 64-byte writeback (8x amplification), whereas the relative ratio for a 16 byte structure is half that. A different way of thinking about it is that the total number of bytes written due to cache writebacks for both versions should be fairly similar if they have similar hit rates. Their hit rates converge as the dataset grows beyond the cache size.

Anyway, we see folks tilt at much shorter windmills than 67% all the time. :-) A 5% to 10% speedup might be interesting to some, especially if it translates to something like increased battery life. A 67% speedup in a key bit of code would be huge for some, but that may indeed be near the peak difference you might expect. In the end, I guess it'll be determined by benchmarking, one hopes.

Some more analysis on the bullets above: Let's just consider the L1D cache, and assume everything hits L2. If we first assume that, then the steady state cost of each p->data++; p = p->next amounts to am L1D linefill plus a victim writeback from L1D to L2 for the replaced line. In this case, then the cost for 32-bit and 64-bit versions should be identical, since every dereference incurs a miss and a victim writeback of the same amount of data.

To see a difference between the 32-bit and 64-bit versions, therefore, you need to take the hit-rate into account. Let's suppose the 32-bit version fits perfectly in L1D, but the 64-bit version (because it's twice the size) only fits halfway. Now none of the 32-bit requests miss, but half of the 64-bit requests do. The 32-bit version incurs no L1D miss penalty and no victim writeback penalty, while the 64-bit version, on average, incurs both on half of its dereferences.

If we continue to reduce the size, eventually both versions fit entirely again, and are once again on an even footing. This suggests at the endpoints of the curve (all accesses hit and all accesses miss), the two perform more or less equivalently, at least for this benchmark. Through the transition band, though, the 64-bit version starts degrading sooner, and the 32-bit version asymptotically approaches its performance in the long tail.

The hit-rate expressions (expressing the hit rate for dereferencing *p->next) for both, assuming no pathological cache behavior and a good random ordering of list nodes and a dataset larger than L1D, should be something along the lines of: hit_rate = size_of_L1D / total_dataset. Now, this implies the hit rate will always be double for the smaller pointer size, because total_dataset would be half the size.

But, the performance will not double if the miss rates are high, because misses are expensive. If we say that the cost of a hit is k1 and the cost of a miss is k2, then the total cost will be (k1 * hit_rate + k2 * (1 - hit_rate)). Suppose for sake of argument that k2 = 10 * k1 and our hit rate is only 10% for 32-bit pointers and 5% for 64-bit pointers. (This ratio of k1 to k2 is fairly reasonable to a first order for modern architectures.) For 32-bit, the cost would be (1 * 10% + 10 * 90%) = 9.1. For 64-bit pointers, the cost would be (1 * 5% + 10 * 95%) = 9.55. You can see how they'd asymptotically approach, since the cost of the misses dominate any gains made by the hits, and doubling the hits does not halve the number of misses.

The picture is quite a bit better for 32-bit if the hit rates are higher though. Suppose the hit rate was 90% for 32-bit pointers and only 45% for 64-bit pointers. Now you have (1 * 90% + 10 * 10%) = 1.9 vs (1 * 45% + 10 * 55%) = 5.95.

Maybe if I get bored later, I could modify my program to collect a sweep of such datapoints. It might be enlightening.

It certainly suggests that 64-bit pointers aren't automatic death for performance. It also suggests that the gains 32-bit pointers might show are rather sensitive to how well your application fits in the cache to begin with, and how far the increased pointer size pushes you from "fitting" toward "not fitting". If you can tune your application to work on subproblems, it may be that you can tune both 32-bit and 64-bit variants to achieve nearly identical performance if you can make both utilize L1 effectively.

A Gentoo x32 release candidate

Posted Jun 8, 2012 21:35 UTC (Fri) by ABCD (subscriber, #53650) [Link]

Using the same benchmark on this system which *does* have x32, I get the following results:

     -m32          -m64         -mx32
  2283.403ms    3339.631ms    2282.777ms
  2278.988ms    3250.245ms    2283.710ms
  2284.797ms    3437.402ms    2285.109ms
  2295.849ms    3344.579ms    2282.430ms
  2247.007ms    2988.275ms    2227.092ms
  2189.324ms    2872.535ms    2178.817ms
  2309.024ms    3118.024ms    2278.871ms
  2341.720ms    3140.920ms    2287.304ms
  2229.621ms    2999.011ms    2207.783ms
  2295.220ms    3435.611ms    2291.899ms

A Gentoo x32 release candidate

Posted Jun 9, 2012 1:30 UTC (Sat) by vapier (guest, #15768) [Link]

the Gentoo stage3 can be d/l-ed and chrooted into. all you need is a host x86_64 kernel with x32 enabled in it. then when you're done, `rm -rf` it and you're free of Gentoo again.