A Gentoo x32 release candidate
A Gentoo x32 release candidate
Posted Jun 6, 2012 19:29 UTC (Wed) by gmaxwell (guest, #30048)In reply to: A Gentoo x32 release candidate by jzbiciak
Parent article: A Gentoo x32 release candidate
This point is of to me because if the advantages are misunderstood distributions which offer less configuration may adopt it in their default x86_64 configurations and send us back to the bad old days of 4gb limits.
I'm happy to see Gentoo offering it as may be pretty interesting in embedded devices. I apologize for the bit of hijacking here.
Posted Jun 7, 2012 10:19 UTC (Thu)
by teknohog (guest, #70891)
[Link] (6 responses)
I agree that x32 has its technical benefits, but it is a nightmare for the consumer/marketing side.
For years, people have been educated on the advantages of 64-bit systems, and now that we are mostly done, we want to confuse things again with yet another ABI. Which is apparently worse at only half the bits. Also, it will take time for closed software vendors to release x32 versions. When they catch up, people will have fun choosing from 3 different binaries.
On the technical side, x32 feels like overoptimization. We already trade some performance for overall convenience, for example by using higher-level languages and libraries. x86-64 already handles everything I currently do with computers, it would be weird to go back to something that does "almost everything", plus extra libraries for the rest.
As for the 4 GB limit per process, there is probably some quote involving 640 KB.
Posted Jun 7, 2012 20:08 UTC (Thu)
by slashdot (guest, #22014)
[Link] (5 responses)
32-bit-only x86 CPUs have already been very rare for a while, so x86 on Linux will hopefully die soon after x32 is released, resulting in a x32+x64 world instead of x86+x64, now with a single kernel architecture.
x86 userland will probably survive forever on Windows though, unless Microsoft decides to introduce x32 as well.
Posted Jun 11, 2012 13:10 UTC (Mon)
by nix (subscriber, #2304)
[Link] (4 responses)
Posted Jun 11, 2012 14:42 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (3 responses)
Posted Jun 11, 2012 22:01 UTC (Mon)
by nix (subscriber, #2304)
[Link] (1 responses)
Posted Jun 11, 2012 22:05 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Besides, amd64 is not much better than x86.
Posted Jun 12, 2012 22:45 UTC (Tue)
by BenHutchings (subscriber, #37955)
[Link]
Posted Jun 8, 2012 10:16 UTC (Fri)
by roblucid (guest, #48964)
[Link] (7 responses)
Actually IMO the increased registers, better DMA & higher CPU feature set base eg) SSE2 & higher resolution timers are the major advantage of AMD64. But back in the 90's, I told a Sun Sales droid that 64 bit suffered pointer bloat (a typical machine back then had 16-64MiB RAM).
Even today none of the daily desktop applications I run are using >3 GiB RAM, so having a x86_64 bit kernel with an x32 bit userland, is a practicable optimisation. Compilers or RDBMS servers, which actually use > 3GiB tend to run on atypical boxen, if you can run those without worrying about the extra x86_64 page faults, then the "bloat" due to duplicated system libraries is not a significant issue.
What's ironic about x32, is it's come so late; a 64 bit kernel with 32 bit userland would have been a good transitional step; now RAM & extra cores are so cheap even low spec machines have minimum 3GiB dual so most people won't noticeably benefit.
Posted Jun 8, 2012 17:03 UTC (Fri)
by teknohog (guest, #70891)
[Link] (6 responses)
True. It is easy to refer to other architectures like MIPS and Power that had an x32-like setup many years ago, but it was also a time of less capable hardware, so it was a more sensible optimization.
Posted Jun 8, 2012 17:17 UTC (Fri)
by jzbiciak (guest, #5246)
[Link]
That's very true. That said, L1D caches haven't gotten larger in that intervening time frame, and I'd suggest their impact on performance is still rather noticeable. L1Ds still seem to hover between 16K (on Zambezi, for example) and 64K (previous K10s).
Posted Jun 8, 2012 17:46 UTC (Fri)
by jzbiciak (guest, #5246)
[Link] (4 responses)
For fun, I put together this admittedly very contrived benchmark just to compare the difference of 4 and 8 byte pointer sizes in perhaps the worst possible L1D thrashing scenario. Please don't laugh too much at my code. I wrote it in a hurry in the last 10 minutes. What this code does, in short, is construct a scrambled linked list of structs, each containing simply a "next" pointer and a pointer to char. I step through the scrambled list incrementing the pointer to char on each element. (I marked the pointer itself as volatile so it wouldn't be dead-coded.) I compiled the code as native 64-bit and as x86 (not x32), and let it run 10 trials of the benchmark loop each. (gcc -O3 -fomit-frame-pointer in both cases; Only difference is that I used -m32 for the 32-bit version.) Here's the results. I'll let you guess which column is 32 bit and which one is 64 bit. Now, I didn't try x32 (I don't have that set up anywhere yet), but I wouldn't expect this simple benchmark to show any benefit for x32 over x86 given its rather narrow scope. The main point was to highlight that L1D cache pollution due to bloated pointers can also be a noticeable factor in some programs. Admittedly, my focused benchmark probably overstates the effect relative to the vast majority of programs, but I thought data might be interesting nonetheless.
Posted Jun 8, 2012 20:27 UTC (Fri)
by nybble41 (subscriber, #55106)
[Link] (1 responses)
Posted Jun 8, 2012 21:31 UTC (Fri)
by jzbiciak (guest, #5246)
[Link]
Interestingly, if I make the array larger to stress the L2 cache, the difference get smaller (14s vs 16s for 100 iterations with (1<<21) nodes). I imagine that's due to some of the following facts: Anyway, we see folks tilt at much shorter windmills than 67% all the time. :-) A 5% to 10% speedup might be interesting to some, especially if it translates to something like increased battery life. A 67% speedup in a key bit of code would be huge for some, but that may indeed be near the peak difference you might expect. In the end, I guess it'll be determined by benchmarking, one hopes. Some more analysis on the bullets above: Let's just consider the L1D cache, and assume everything hits L2. If we first assume that, then the steady state cost of each p->data++; p = p->next amounts to am L1D linefill plus a victim writeback from L1D to L2 for the replaced line. In this case, then the cost for 32-bit and 64-bit versions should be identical, since every dereference incurs a miss and a victim writeback of the same amount of data. To see a difference between the 32-bit and 64-bit versions, therefore, you need to take the hit-rate into account. Let's suppose the 32-bit version fits perfectly in L1D, but the 64-bit version (because it's twice the size) only fits halfway. Now none of the 32-bit requests miss, but half of the 64-bit requests do. The 32-bit version incurs no L1D miss penalty and no victim writeback penalty, while the 64-bit version, on average, incurs both on half of its dereferences. If we continue to reduce the size, eventually both versions fit entirely again, and are once again on an even footing. This suggests at the endpoints of the curve (all accesses hit and all accesses miss), the two perform more or less equivalently, at least for this benchmark. Through the transition band, though, the 64-bit version starts degrading sooner, and the 32-bit version asymptotically approaches its performance in the long tail. The hit-rate expressions (expressing the hit rate for dereferencing *p->next) for both, assuming no pathological cache behavior and a good random ordering of list nodes and a dataset larger than L1D, should be something along the lines of: hit_rate = size_of_L1D / total_dataset. Now, this implies the hit rate will always be double for the smaller pointer size, because total_dataset would be half the size. But, the performance will not double if the miss rates are high, because misses are expensive. If we say that the cost of a hit is k1 and the cost of a miss is k2, then the total cost will be (k1 * hit_rate + k2 * (1 - hit_rate)). Suppose for sake of argument that k2 = 10 * k1 and our hit rate is only 10% for 32-bit pointers and 5% for 64-bit pointers. (This ratio of k1 to k2 is fairly reasonable to a first order for modern architectures.) For 32-bit, the cost would be (1 * 10% + 10 * 90%) = 9.1. For 64-bit pointers, the cost would be (1 * 5% + 10 * 95%) = 9.55. You can see how they'd asymptotically approach, since the cost of the misses dominate any gains made by the hits, and doubling the hits does not halve the number of misses. The picture is quite a bit better for 32-bit if the hit rates are higher though. Suppose the hit rate was 90% for 32-bit pointers and only 45% for 64-bit pointers. Now you have (1 * 90% + 10 * 10%) = 1.9 vs (1 * 45% + 10 * 55%) = 5.95. Maybe if I get bored later, I could modify my program to collect a sweep of such datapoints. It might be enlightening. It certainly suggests that 64-bit pointers aren't automatic death for performance. It also suggests that the gains 32-bit pointers might show are rather sensitive to how well your application fits in the cache to begin with, and how far the increased pointer size pushes you from "fitting" toward "not fitting". If you can tune your application to work on subproblems, it may be that you can tune both 32-bit and 64-bit variants to achieve nearly identical performance if you can make both utilize L1 effectively.
Posted Jun 8, 2012 21:35 UTC (Fri)
by ABCD (subscriber, #53650)
[Link]
Posted Jun 9, 2012 1:30 UTC (Sat)
by vapier (guest, #15768)
[Link]
A Gentoo x32 release candidate
A Gentoo x32 release candidate
A Gentoo x32 release candidate
32-bit-only x86 CPUs have already been very rare for a while
Uh, Atom?
A Gentoo x32 release candidate
A Gentoo x32 release candidate
A Gentoo x32 release candidate
A Gentoo x32 release candidate
A Gentoo x32 release candidate
A Gentoo x32 release candidate
A Gentoo x32 release candidate
A Gentoo x32 release candidate
1688.600ms 2606.294ms
1671.547ms 2561.276ms
1670.577ms 2626.574ms
1668.617ms 2599.231ms
1621.522ms 2193.314ms
1573.468ms 2108.417ms
1669.220ms 2266.626ms
1668.592ms 2507.869ms
1624.254ms 2195.270ms
1675.467ms 2611.658ms
A Gentoo x32 release candidate
A Gentoo x32 release candidate
Using the same benchmark on this system which *does* have x32, I get the following results:
A Gentoo x32 release candidate
-m32 -m64 -mx32
2283.403ms 3339.631ms 2282.777ms
2278.988ms 3250.245ms 2283.710ms
2284.797ms 3437.402ms 2285.109ms
2295.849ms 3344.579ms 2282.430ms
2247.007ms 2988.275ms 2227.092ms
2189.324ms 2872.535ms 2178.817ms
2309.024ms 3118.024ms 2278.871ms
2341.720ms 3140.920ms 2287.304ms
2229.621ms 2999.011ms 2207.783ms
2295.220ms 3435.611ms 2291.899ms
A Gentoo x32 release candidate
