|
|
Subscribe / Log in / New account

A Gentoo x32 release candidate

A Gentoo x32 release candidate

Posted Jun 6, 2012 15:40 UTC (Wed) by gmaxwell (guest, #30048)
Parent article: A Gentoo x32 release candidate

(A comment on the LWN blurb, not the post)

"getting the best of both"

… Because my life was not complete without being limited to 32 bits of VM space and the resulting loss of program scalability and ASLR effectiveness!


to post comments

A Gentoo x32 release candidate

Posted Jun 6, 2012 15:51 UTC (Wed) by mikemol (guest, #83507) [Link] (22 responses)

x32 is intended for programs which don't require more than 32-bits of VM space. Which is very likely most of the programs on your system.

As for reduced ASLR effectiveness...Unless your program is very, very long-running, and an attack that ASLR may mitigate won't otherwise cause the program to fail, I don't see there being a meaningfully large distinction. Even with a 64-bit-address-space system, the actual usable area of the address space isn't 64 bits.

You're aware, of course, that x86-64 systems have a 64-bit-wide pointer, but often don't have that many address lines available at the hardware level for logical or physical addresses? Last time I looked at /proc/cpuinfo (and paid attention), I only had 48 bits for virtual, 48 bits for physical.

A Gentoo x32 release candidate

Posted Jun 6, 2012 17:02 UTC (Wed) by gmaxwell (guest, #30048) [Link] (21 responses)

My browser is _currently_ using more than 4GB of VM. G++ builds (esp with LTO) can easily use that much VM. I frequently run computation and analysis where I need more than 4GB in my processes. Moreover, the programs that use tons of pointers where x32 would be a big savings aren't using much memory to begin with.

Sure, you can multiarch it but then you're wasting memory with two copies of at least libc in memory and probably hundreds of other libraries.

My understanding was that x32 was primarily interesting on mobile and I think at the moment it makes sense there. Desktop/server (even embedded) is not so clear to me.

As far as ASLR goes I suggest you go measure the entropy of the ASLR positioning on x86. It's very low. Esp for the attacker doesn't need to hit a position exactly and so the low bits don't contribute.

A Gentoo x32 release candidate

Posted Jun 6, 2012 17:05 UTC (Wed) by gmaxwell (guest, #30048) [Link] (20 responses)

> Moreover, the programs that use tons of pointers where x32 would be a big savings aren't using much memory to begin with.

Gah. I meant to say that they're either not using much memory, in which case the savings doesn't matter— or they are and they're the kind of workload where 64 vs 32 has scaling implications. (e.g. the browser)

A Gentoo x32 release candidate

Posted Jun 6, 2012 17:31 UTC (Wed) by mikemol (guest, #83507) [Link] (13 responses)

I very intentionally didn't mention the browser as an application you'd want to be 32-bit. I thought about Chrome's model of one-process-per-tab, and decided I still liked the larger address for mmap and IPC purposes. The browser (or, at least, most of it) should be 64-bit. Perhaps there'd be sufficiently low overhead to have just the JS engine 32-bit.

Your browser is only one out of hundreds (if not thousands) of programs on your computer. Many (most?) of them only run for a few moments, or otherwise don't (or don't derive meaningful benefit from) consume huge amounts of memory memory.

Take the 'dd' command. top. ls. bash. dash. cp. mv. echo. cat. tee. cupsd. dbus-daemon. lpr. grep. find. xargs.

The programs you spend hours every day staring at? Yeah, those probably benefit from having a 64-bit address space. The programs you don't think about, often when you're not even actively using them? They probably don't.

A Gentoo x32 release candidate

Posted Jun 6, 2012 17:39 UTC (Wed) by gmaxwell (guest, #30048) [Link] (9 responses)

Yes, and how much benefit is there from making dd, top, ls, bash, cp, mv. echo, cat, tee, etc. x32 instead of x86_64? They have (and should have) very few pointers. So there should be very little memory savings, very little cpu cycle reduction from memcpying smaller pointers. (and if not, those programs should be fixed— certainly it would be easier to fix them to not copy huge pointer arrays than it would be to fix the big tools not to need a lot of vm)

But they do link shared libraries— at least libc— which is rather large. So if you're going to have a mix of x32 and x86_64 programs running you're going to end up with another copy of libc in memory for those things, passing through your caches, etc... which should easily offset the tiny gains from making those programs x32.

A Gentoo x32 release candidate

Posted Jun 6, 2012 18:13 UTC (Wed) by mikemol (guest, #83507) [Link]

Anything that uses linked-lists or tree data structures stands to benefit. And if you're dealing in dense packs of pointers in a data structure, you'll probably benefit from that fitting more tightly into a cache line.

A Gentoo x32 release candidate

Posted Jun 6, 2012 18:47 UTC (Wed) by and (guest, #2883) [Link]

I don't want to hurt anyone's feelings, but I'm working on CFD simulation code. The problem which I encounter on a daily basis, is that these programs are _very_ clearly CPU-bound (read: they eat up all your CPU time and use still way below 1GBit per core). Thus I'm really enthusiastic to try x32. (Once it's available in a mainstream distribution, that is. I've given up on Gentoo a few years ago...)

A Gentoo x32 release candidate

Posted Jun 6, 2012 22:27 UTC (Wed) by butlerm (subscriber, #13312) [Link] (6 responses)

> which should easily offset the tiny gains from making those programs x32.

Those programs, yes. There is a significant class of other programs that can be sped up by as much as 40% compared to x86-64. The advantage is so great that x32 is reasonably likely to predominate over the latter in the future, outside a relatively narrow set of applications.

A Gentoo x32 release candidate

Posted Jun 6, 2012 23:28 UTC (Wed) by andrel (guest, #5166) [Link] (5 responses)

I'll bite -- what are the classes of programs for which x32 gets a 40% speedup over x86-64?

A Gentoo x32 release candidate

Posted Jun 7, 2012 0:34 UTC (Thu) by dlang (guest, #313) [Link] (2 responses)

pointer heavy programs where the smaller pointer size lets more data fit in the cpu cache instead of the app having to wait for the data to be read in from memory.

I don't know any specific programs, but there are people who have reported that using 32 bit apps on 64 bit systems results in better performance than using 64 bit apps.

This seldom applies on the AMD64 architecture as 64 bit mode also gives you twice as many registers to use, but on Sparc and Power* systems this is a very common situation.

x32 is creating an equivalent architecture for the AMD64 systems.

A Gentoo x32 release candidate

Posted Jun 7, 2012 9:06 UTC (Thu) by dvandeun (guest, #24273) [Link] (1 responses)

I develop an interpreter for a toy language in Haskell on an old i3 540 MacBook with 32 bit ghc. When I compile it on a development server at the university, with fast Xeons and lots of cache and RAM, and 64 bit ghc, it is not faster on a quicksort benchmark. (This is of course a double effect: Haskell code uses lots of pointers, and quicksort on linked lists uses lots of pointers. On other benchmarks of my interpreter, the 64 bit server does better than the MacBook, but not spectacularly better.)

A Gentoo x32 release candidate

Posted Jun 10, 2012 3:48 UTC (Sun) by vonbrand (subscriber, #4458) [Link]

... not to mention that quicksort (which is designed for arrays) makes next to no sense on lists...

A Gentoo x32 release candidate

Posted Jun 7, 2012 21:48 UTC (Thu) by paulj (subscriber, #341) [Link]

I've measured the v8 JavaScript JIT to be slightly faster with i686 than AMD64, on javascript benchmarks. I'd expect x32 to be slightly faster again. Anything where memory usage is dominated by pointer rich data-structures (e.g. complex indices over small units of data) will be faster with x32, if it doesn't need the 32bit address space.

Also, as overall system memory usage is generally lower with x32, it allows, e.g., more VMs to be run for the same amount of memory.

A Gentoo x32 release candidate

Posted Jun 8, 2012 20:54 UTC (Fri) by butlerm (subscriber, #13312) [Link]

> I'll bite -- what are the classes of programs for which x32 gets a 40% speedup over x86-64?

The specific example I had in mind is 181.mcf, part of the SPEC 2000 CPU benchmark.

http://www.spec.org/cpu2000/CINT2000/181.mcf/docs/181.mcf...

I imagine that many Perl, Python, and Java programs will show comparable improvements, in addition to compilers, linkers, web browsers, xml processors, interpreters, x32 native kernels, and garbage collected languages in general.

With support for near and far pointers it is conceivable one could dramatically improve kernel performance as well, making an x32/x86-64 hybrid kernel perform nearly as well as an x32 native one, without losing the ability to support 64 bit applications.

A Gentoo x32 release candidate

Posted Jun 7, 2012 13:54 UTC (Thu) by foom (subscriber, #14868) [Link]

Chrome on Windows is only available as a 32bit binary. Chrome on linux is likely only available as x86-64 because 32-bit libraries are not always readily available on a x86-64 linux distributions, so it was necessary.

Why do you think that Chrome on Linux would actually need the 64-bit address space when the vast majority of the installs (Windows) are all 32bit and work great?

A Gentoo x32 release candidate

Posted Jun 18, 2012 7:47 UTC (Mon) by massimiliano (subscriber, #3048) [Link]

I very intentionally didn't mention the browser as an application you'd want to be 32-bit. I thought about Chrome's model of one-process-per-tab, and decided I still liked the larger address for mmap and IPC purposes. The browser (or, at least, most of it) should be 64-bit. Perhaps there'd be sufficiently low overhead to have just the JS engine 32-bit.

Well, for most of the world "Chrome" means "Chrome on Windows", and "Chrome on Windows" means "the 32bit Chrome build".

And since Chrome works pretty well on Windows I guess a 32bit build should work well also on our beloved Linux desktops...

In fact here (V8 development team) we work on 64bit Linux hosts but we test and develop 32bit x86 before anything else, and then make sure that also amd64 and arm work perfectly. But when we look at performance numbers we do it mainly on the 32bit builds.

A Gentoo x32 release candidate

Posted Jun 18, 2012 11:29 UTC (Mon) by hummassa (subscriber, #307) [Link]

The browser works by storing the DOM in a data structure that is crowded with pointers; add to that the fact that if you have one sandbox with over 3GB of data you are pretty much in the insane corner, I would guesstimate Chrome/Chromium as benefitting deeply from being 32bit.

A Gentoo x32 release candidate

Posted Jun 6, 2012 20:23 UTC (Wed) by jpnp (guest, #63341) [Link] (5 responses)

The issue is not the 4Gb RAM limit, but the few precious few Mb of cache. I have data structure (pointer) heavy code which has moderate memory requirements but requires a lot of manipulation. Smaller pointers equals better cache locality.

I'm confident they would benefit as they already benchmark better running as 32bit code on AMD64, adding the extra registers from the 64bit ABI can only aid the compiler.

Mind you, I don't see a great need for the whole OS to be X32, just support for X32 applications running on AMD64 for those workloads where it helps.

A Gentoo x32 release candidate

Posted Jun 7, 2012 0:43 UTC (Thu) by nybble41 (subscriber, #55106) [Link] (4 responses)

> The issue is not the 4Gb RAM limit, but the few precious few Mb of cache. I have data structure (pointer) heavy code which has moderate memory requirements but requires a lot of manipulation. Smaller pointers equals better cache locality.

You've probably already considered this, but for workloads like this, why not pre-allocate a moderate-sized pool of memory for this data and store just the offsets? That seems like a less intrusive solution than requiring multiple copies of system libraries to support amd64 and x32 side-by-side.

Also, is it too much to ask that x32 applications be capable of interacting with amd64 libraries? Perhaps merge x32 and x86_64 into a single ABI with "near" and "far" pointers? If mixed code always limits itself to a 32-bit address space, and x32 code uses the 64-bit system call ABI, then it should be possible to convert between "near" and "far" pointers transparently and use a single set of libraries for both modes. The only remaining issue that I can see is making sure the compiler knows which pointers need to be "far" pointers even when compiled in a x32 context (e.g. shared library header files).

A Gentoo x32 release candidate

Posted Jun 7, 2012 2:51 UTC (Thu) by butlerm (subscriber, #13312) [Link] (3 responses)

> You've probably already considered this, but for workloads like this, why not pre-allocate a moderate-sized pool of memory for this data and store just the offsets?

You can recompile well written programs for an ABI like this without any source code changes. Manually adding offsets, on the other hand, is slower and makes for unusually ugly looking code.

> Also, is it too much to ask that x32 applications be capable of interacting with amd64 libraries?

It is conceivable that shims could be provided for some 64-bit libraries, but in the general case (C++ libraries for example) it is not even practical.

Most initial x32 systems are likely to be x32 only. I wouldn't expect a desktop distribution to come with full libraries for both x32 and x86-64, one would probably either have x32 releases that come with a handful of 64 bit packages, or x86-64 releases that come with a handful of x32 packages.

A Gentoo x32 release candidate

Posted Jun 7, 2012 3:57 UTC (Thu) by nybble41 (subscriber, #55106) [Link] (2 responses)

> It is conceivable that shims could be provided for some 64-bit libraries, but in the general case (C++ libraries for example) it is not even practical.

I wasn't actually talking about providing shims. Rather, shared libraries would be compiled just as they are now in amd64 mode. The x32 programs would use 64-bit pointers in shared data structures and APIs, and 32-bit pointers in their own internal structures and APIs. Obviously, for this to work either the x32 parts or the dual-mode parts have to be marked somehow, e.g. with an attribute or a pragma line, so the compiler knows to use the larger pointers when compiling shared APIs for x32. Since any application with x32 components is guaranteed to run in a 32-bit address space, converting between the 64-bit and 32-bit pointers is trivial--the most significant 32 bits of the full-size pointers are always zero. Apart from marking the boundaries, the compiler can do all of the work.

> I wouldn't expect a desktop distribution to come with full libraries for both x32 and x86-64, one would probably either have x32 releases that come with a handful of 64 bit packages, or x86-64 releases that come with a handful of x32 packages.

The problem is the dependencies. To add just one moderately complex "foreign" package and you may end up needing duplicates of most of the system libraries. Some packages are relatively standalone, but what if you wanted, say, an x32 build of Chromium on an amd64 system? You'd need x32 builds of around 133 other packages[1] just to provide that one application.

[1] Estimated with: apt-cache depends --recurse -i chromium|awk '/^\s*Depends:\s+lib/{print $2;}'|sort -u

A Gentoo x32 release candidate

Posted Jun 7, 2012 4:15 UTC (Thu) by mikemol (guest, #83507) [Link]

The x32 ABI is, in part, a redefinition of how the C and C++ languages operate on x86-64. You're telling the compiler that hey, pointers and 'long' are 32-bit.

You're *not* going to be able to interlink x32 and x86-64 binaries while sharing headers unless you make those headers aware of the differing binary representations of the types...and if you do that, you're making things significantly more complicated over a broad cross-section of code. That means tons of bugs.

As for having per-arch copies of the same binaries...that's already status quo on multilib systems. Not that big of a problem, really. x32 is poised to replace the old 32-bit ABI, with its segmented memory model and relatively limited register and CPU instruction set, with a 32-bit ABI with more registers and a higher-level guaranteed minimum for CPU instruction set availability. x32, in a sense, represents the new "i686" minimum compiler target for x86 systems with a 32-bit ABI.

A Gentoo x32 release candidate

Posted Jun 8, 2012 7:34 UTC (Fri) by khim (subscriber, #9252) [Link]

Apart from marking the boundaries, the compiler can do all of the work.

Nope. Think about standard library. memcpy quite obviously does not need to convert pointers, but aio_read needs to do that. And if you pass structures with pointers to functions around then it becomes real ugly real fast.

x86-64 NaCl is independent reimplementation of x32 architecture (we plan to rebase our change on top of x32 when it'll be stable) and for initial benchmarks we used standard x86-64 glibc linked with our x32-like binary. This was a disaster: it was possible to compile and run few simpler SPEC CPU2000 benchmarks this way, but things like 253.perlbmk just refused to work properly.

When we've finally got the loader and libc ported we've dropped this mixed mode as a hot potato. It's not worth it, believe me.

Some packages are relatively standalone, but what if you wanted, say, an x32 build of Chromium on an amd64 system? You'd need x32 builds of around 133 other packages[1] just to provide that one application.

Right. This is a lot of work. But it's still simpler then to try to stitch Chromium from x32 pieces and x86-64 pieces.

A Gentoo x32 release candidate

Posted Jun 6, 2012 15:54 UTC (Wed) by realnc (guest, #60393) [Link] (9 responses)

Yes, I use Firefox with 10GB web pages and write 6GB big emails too.

A Gentoo x32 release candidate

Posted Jun 6, 2012 16:20 UTC (Wed) by moltonel (subscriber, #45207) [Link] (8 responses)

The vast majority of your programs need less than 4GB, but for those that do (database, games, etc) you should be able to compile them with the amd64 ABI and run them alongside the rest.

I hope that the Gentoo x32 arch provides that, but given that it has a lone /lib folder, I'm not sure. Does somebody know ?

As for the weaker ASLR, it the old speed-security tradeoff. But ASLR on 32bits isn't useless either. And you could compile "only the security-sensitive programms" in amd64.

A Gentoo x32 release candidate

Posted Jun 6, 2012 16:43 UTC (Wed) by marduk (subscriber, #3831) [Link] (7 responses)

Gentoo doesn't have a lone lib directory.

For the "traditional" amd64 multilink profiles, Gentoo has "lib32" and "lib64" directories, where "lib" is a simlink to lib64. For x32 profiles, multilib goes away and you have "lib", "libx32" and "lib64" directories.

Now, at least *now* it appears you can switch profiles in order to build packages for whatever arch. E.g. if I want to build "true" 64-bit sqlite then I can "eslect profile" to amd64. Now I don't know if there will be the possibility to build the same package both x32 and 64bit... you'd probably need a seperate chroot (like we do for building 32-bit packages on amd64) or similar.

Or maybe they will release some emul-* packages the way they do for some 32bit libs on amd64...

Personally I think the use case for this will be fairly small.

A Gentoo x32 release candidate

Posted Jun 6, 2012 17:19 UTC (Wed) by ilmari (guest, #14175) [Link] (4 responses)

For the "traditional" amd64 multilink profiles, Gentoo has "lib32" and "lib64" directories, where "lib" is a simlink to lib64. For x32 profiles, multilib goes away and you have "lib", "libx32" and "lib64" directories.
What? Why on earth are they not doing proper /lib/<triplet> multiarch if they're changing things around anyway?

A Gentoo x32 release candidate

Posted Jun 6, 2012 19:06 UTC (Wed) by jengelh (guest, #33263) [Link] (3 responses)

You know the saying: That which is not understood is reinvented, and poorly so.

A Gentoo x32 release candidate

Posted Jun 6, 2012 19:44 UTC (Wed) by realnc (guest, #60393) [Link]

> You know the saying: That which is not understood is reinvented, and
> poorly so.

You could ask them if you're really interested instead of being an ass.

A Gentoo x32 release candidate

Posted Jun 7, 2012 4:16 UTC (Thu) by dirtyepic (guest, #30178) [Link] (1 responses)

Well, I think you're confusing multilib and multiarch, which we do both of, but the non-standard way Gentoo treats multilib is mainly the result of doing multilib before it became standard. When binary distros went one way, Gentoo went another. This caused its share of trouble in the past (much less so these days) and efforts have been made to narrow that divide, but this can be a very difficult change to make when you consider that Gentoo is a rolling distro and does not have the benefit of standard releases where this kind of major overhaul can be done. Instead it has to be carefully phased in a bit at a time, which is what has been going on for the last couple years.

Some of the things we do may seem strange and needlessly different to those outside our little bubble (and I'm sure some of them actually are). But before you dismiss us as ignorant, please take a minute to consider that the challenges faced in designing and maintaining a source-based rolling-release metadistribution are very different from those of a conventional binary distro. The solutions we find to overcome these challenges are going to look very different as well.

A Gentoo x32 release candidate

Posted Jun 11, 2012 23:12 UTC (Mon) by BenHutchings (subscriber, #37955) [Link]

Debian has a rolling distribution, called 'sid', and a mostly-rolling distribution, called 'testing'. :-) But we manage to do these sorts of transitions, although as with Gentoo they sometimes have to be phased over a long time.

A Gentoo x32 release candidate

Posted Jun 7, 2012 15:01 UTC (Thu) by moltonel (subscriber, #45207) [Link] (1 responses)

> Gentoo doesn't have a lone lib directory.
> For the "traditional" amd64 multilink profiles, Gentoo has "lib32" and "lib64"
> directories, where "lib" is a simlink to lib64. For x32 profiles, multilib goes
> away and you have "lib", "libx32" and "lib64" directories.

I know the current gentoo amd64 has multiple /lib (I have been using gentoo multilib for at least 6 years), and would be happy to hear that this new arch will use something limilar, but the announcement says:

> the x32 ABI is the default one, and includes x86/amd64 ABIs. it is not using
> /lib32/ (and /lib is not a symlink) like our existing amd64 multilib as that
> is being phased out, and the x32 port allows me to do a clean break.

and the tarball indeed has one single "/lib". No "/lib{,/}{x32,amd64}" in sight.

So I'll rephrase :
* the current gentoo x32 prerelease doesn't seem to handle multi{lib,arch}. Is that correct ?
* If so, is multilib support planed in the final release ?

A Gentoo x32 release candidate

Posted Jun 7, 2012 18:04 UTC (Thu) by tetromino (guest, #33846) [Link]

> and the tarball indeed has one single "/lib". No "/lib{,/}{x32,amd64}" in sight.

You and I must be looking at different tarballs. In my copy of stage3-amd64-x32-20120605.tar.xz, I see /usr/libx32, /usr/lib64, and /usr/lib.

> the current gentoo x32 prerelease doesn't seem to handle multi{lib,arch}. Is that correct ?

That is not correct; Gentoo's default x32 profile is multilib. Take a look at your /usr/portage/profiles/arch/amd64/x32/make.defaults file:

DEFAULT_ABI="x32"
ABI="x32"
MULTILIB_ABIS="amd64 x86 x32"

FEATURES="collision-protect multilib-strict"

SYMLINK_LIB="no"

A Gentoo x32 release candidate

Posted Jun 6, 2012 17:23 UTC (Wed) by chithanh (guest, #52801) [Link]

> Because my life was not complete without being limited to 32 bits of VM space and the resulting loss of program scalability and ASLR effectiveness!

On Gentoo (Hardened), the 32 bit address space is less of a security concern than with other distros:
http://labs.mwrinfosecurity.com/blog/2010/09/02/assessing...

A Gentoo x32 release candidate

Posted Jun 6, 2012 18:07 UTC (Wed) by jzbiciak (guest, #5246) [Link] (18 responses)

*sigh* If you don't like x32, then don't use it. Others (including myself) think it's interesting. Heck, even Donald Knuth will likely find this interesting. (Scroll down to "A Flame About 64-bit Pointers".)

Do you also troll vi/emacs releases and Firefox/Chrome releases?

A Gentoo x32 release candidate

Posted Jun 6, 2012 19:29 UTC (Wed) by gmaxwell (guest, #30048) [Link] (15 responses)

I just responded correcting the inaccurate claim that it gives you the best of both (inaccurate because the a major advantage of 64 bit is the increased memory space, and because multilib bloat will likely washes the advantage of x32 unless you are x32 only).

This point is of to me because if the advantages are misunderstood distributions which offer less configuration may adopt it in their default x86_64 configurations and send us back to the bad old days of 4gb limits.

I'm happy to see Gentoo offering it as may be pretty interesting in embedded devices. I apologize for the bit of hijacking here.

A Gentoo x32 release candidate

Posted Jun 7, 2012 10:19 UTC (Thu) by teknohog (guest, #70891) [Link] (6 responses)

> if the advantages are misunderstood distributions which offer less configuration may adopt it in their default x86_64 configurations and send us back to the bad old days of 4gb limits.

I agree that x32 has its technical benefits, but it is a nightmare for the consumer/marketing side.

For years, people have been educated on the advantages of 64-bit systems, and now that we are mostly done, we want to confuse things again with yet another ABI. Which is apparently worse at only half the bits. Also, it will take time for closed software vendors to release x32 versions. When they catch up, people will have fun choosing from 3 different binaries.

On the technical side, x32 feels like overoptimization. We already trade some performance for overall convenience, for example by using higher-level languages and libraries. x86-64 already handles everything I currently do with computers, it would be weird to go back to something that does "almost everything", plus extra libraries for the rest.

As for the 4 GB limit per process, there is probably some quote involving 640 KB.

A Gentoo x32 release candidate

Posted Jun 7, 2012 20:08 UTC (Thu) by slashdot (guest, #22014) [Link] (5 responses)

x32 is just a faster version of x86, and so doesn't add much complexity.

32-bit-only x86 CPUs have already been very rare for a while, so x86 on Linux will hopefully die soon after x32 is released, resulting in a x32+x64 world instead of x86+x64, now with a single kernel architecture.

x86 userland will probably survive forever on Windows though, unless Microsoft decides to introduce x32 as well.

A Gentoo x32 release candidate

Posted Jun 11, 2012 13:10 UTC (Mon) by nix (subscriber, #2304) [Link] (4 responses)

32-bit-only x86 CPUs have already been very rare for a while
Uh, Atom?

A Gentoo x32 release candidate

Posted Jun 11, 2012 14:42 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

New Atoms support x86_64.

A Gentoo x32 release candidate

Posted Jun 11, 2012 22:01 UTC (Mon) by nix (subscriber, #2304) [Link] (1 responses)

Excellent! Maybe we *can* get rid of x86 then. I was resigned to its being immortal...

A Gentoo x32 release candidate

Posted Jun 11, 2012 22:05 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

Well, there's NaCl remaining which uses x86 code.

Besides, amd64 is not much better than x86.

A Gentoo x32 release candidate

Posted Jun 12, 2012 22:45 UTC (Tue) by BenHutchings (subscriber, #37955) [Link]

The desktop Atom models are all 64-bit capable now, but there are still mobile and embedded models available that use older 32-bit cores.

A Gentoo x32 release candidate

Posted Jun 8, 2012 10:16 UTC (Fri) by roblucid (guest, #48964) [Link] (7 responses)

"I just responded correcting the inaccurate claim that it gives you the best of both (inaccurate because the a major advantage of 64 bit is the increased memory space, and because multilib bloat will likely washes the advantage of x32 unless you are x32 only)."

Actually IMO the increased registers, better DMA & higher CPU feature set base eg) SSE2 & higher resolution timers are the major advantage of AMD64. But back in the 90's, I told a Sun Sales droid that 64 bit suffered pointer bloat (a typical machine back then had 16-64MiB RAM).

Even today none of the daily desktop applications I run are using >3 GiB RAM, so having a x86_64 bit kernel with an x32 bit userland, is a practicable optimisation. Compilers or RDBMS servers, which actually use > 3GiB tend to run on atypical boxen, if you can run those without worrying about the extra x86_64 page faults, then the "bloat" due to duplicated system libraries is not a significant issue.

What's ironic about x32, is it's come so late; a 64 bit kernel with 32 bit userland would have been a good transitional step; now RAM & extra cores are so cheap even low spec machines have minimum 3GiB dual so most people won't noticeably benefit.

A Gentoo x32 release candidate

Posted Jun 8, 2012 17:03 UTC (Fri) by teknohog (guest, #70891) [Link] (6 responses)

> What's ironic about x32, is it's come so late; a 64 bit kernel with 32 bit userland would have been a good transitional step; now RAM & extra cores are so cheap even low spec machines have minimum 3GiB dual so most people won't noticeably benefit.

True. It is easy to refer to other architectures like MIPS and Power that had an x32-like setup many years ago, but it was also a time of less capable hardware, so it was a more sensible optimization.

A Gentoo x32 release candidate

Posted Jun 8, 2012 17:17 UTC (Fri) by jzbiciak (guest, #5246) [Link]

That's very true.

That said, L1D caches haven't gotten larger in that intervening time frame, and I'd suggest their impact on performance is still rather noticeable. L1Ds still seem to hover between 16K (on Zambezi, for example) and 64K (previous K10s).

A Gentoo x32 release candidate

Posted Jun 8, 2012 17:46 UTC (Fri) by jzbiciak (guest, #5246) [Link] (4 responses)

For fun, I put together this admittedly very contrived benchmark just to compare the difference of 4 and 8 byte pointer sizes in perhaps the worst possible L1D thrashing scenario. Please don't laugh too much at my code. I wrote it in a hurry in the last 10 minutes.

What this code does, in short, is construct a scrambled linked list of structs, each containing simply a "next" pointer and a pointer to char. I step through the scrambled list incrementing the pointer to char on each element. (I marked the pointer itself as volatile so it wouldn't be dead-coded.)

I compiled the code as native 64-bit and as x86 (not x32), and let it run 10 trials of the benchmark loop each. (gcc -O3 -fomit-frame-pointer in both cases; Only difference is that I used -m32 for the 32-bit version.)

Here's the results. I'll let you guess which column is 32 bit and which one is 64 bit.

  1688.600ms      2606.294ms
  1671.547ms      2561.276ms
  1670.577ms      2626.574ms
  1668.617ms      2599.231ms
  1621.522ms      2193.314ms
  1573.468ms      2108.417ms
  1669.220ms      2266.626ms
  1668.592ms      2507.869ms
  1624.254ms      2195.270ms
  1675.467ms      2611.658ms

Now, I didn't try x32 (I don't have that set up anywhere yet), but I wouldn't expect this simple benchmark to show any benefit for x32 over x86 given its rather narrow scope. The main point was to highlight that L1D cache pollution due to bloated pointers can also be a noticeable factor in some programs.

Admittedly, my focused benchmark probably overstates the effect relative to the vast majority of programs, but I thought data might be interesting nonetheless.

A Gentoo x32 release candidate

Posted Jun 8, 2012 20:27 UTC (Fri) by nybble41 (subscriber, #55106) [Link] (1 responses)

Considering that you doubled the sizes of your nodes, I'm a bit surprised that the 64-bit version only took 67% longer (going by maximum 64-bit time vs. minimum 32-bit time). If that's the worst case, perhaps the performance benefits of 32-bit pointers really are somewhat exaggerated, at least for programs which aren't dealing exclusively with pointers.

A Gentoo x32 release candidate

Posted Jun 8, 2012 21:31 UTC (Fri) by jzbiciak (guest, #5246) [Link]

Interestingly, if I make the array larger to stress the L2 cache, the difference get smaller (14s vs 16s for 100 iterations with (1<<21) nodes). I imagine that's due to some of the following facts:

  • The r-m-w on the datum is guaranteed to hit L1 after ->next gets brought in, regardless of pointer size, which means this operation is equal cost for 32-bit and 64-bit and can largely be ignored, save for the victim writebacks it generates.
  • The subsequent loop iteration (ie. accessing the next structure in the list) is much, much more likely to miss L1D regardless of pointer size with the larger data set, and somewhat more likely to miss L2. This tends to equalize 32-bit and 64-bit performance. (See analysis below).
  • The 64-bit version shows less relative bandwidth amplification due to cache writebacks than the 32-bit version. That is, with my CPU's 64-byte linesize, an r-m-w on an 8 byte structure could generate a 64-byte writeback (8x amplification), whereas the relative ratio for a 16 byte structure is half that. A different way of thinking about it is that the total number of bytes written due to cache writebacks for both versions should be fairly similar if they have similar hit rates. Their hit rates converge as the dataset grows beyond the cache size.

Anyway, we see folks tilt at much shorter windmills than 67% all the time. :-) A 5% to 10% speedup might be interesting to some, especially if it translates to something like increased battery life. A 67% speedup in a key bit of code would be huge for some, but that may indeed be near the peak difference you might expect. In the end, I guess it'll be determined by benchmarking, one hopes.


Some more analysis on the bullets above: Let's just consider the L1D cache, and assume everything hits L2. If we first assume that, then the steady state cost of each p->data++; p = p->next amounts to am L1D linefill plus a victim writeback from L1D to L2 for the replaced line. In this case, then the cost for 32-bit and 64-bit versions should be identical, since every dereference incurs a miss and a victim writeback of the same amount of data.

To see a difference between the 32-bit and 64-bit versions, therefore, you need to take the hit-rate into account. Let's suppose the 32-bit version fits perfectly in L1D, but the 64-bit version (because it's twice the size) only fits halfway. Now none of the 32-bit requests miss, but half of the 64-bit requests do. The 32-bit version incurs no L1D miss penalty and no victim writeback penalty, while the 64-bit version, on average, incurs both on half of its dereferences.

If we continue to reduce the size, eventually both versions fit entirely again, and are once again on an even footing. This suggests at the endpoints of the curve (all accesses hit and all accesses miss), the two perform more or less equivalently, at least for this benchmark. Through the transition band, though, the 64-bit version starts degrading sooner, and the 32-bit version asymptotically approaches its performance in the long tail.

The hit-rate expressions (expressing the hit rate for dereferencing *p->next) for both, assuming no pathological cache behavior and a good random ordering of list nodes and a dataset larger than L1D, should be something along the lines of: hit_rate = size_of_L1D / total_dataset. Now, this implies the hit rate will always be double for the smaller pointer size, because total_dataset would be half the size.

But, the performance will not double if the miss rates are high, because misses are expensive. If we say that the cost of a hit is k1 and the cost of a miss is k2, then the total cost will be (k1 * hit_rate + k2 * (1 - hit_rate)). Suppose for sake of argument that k2 = 10 * k1 and our hit rate is only 10% for 32-bit pointers and 5% for 64-bit pointers. (This ratio of k1 to k2 is fairly reasonable to a first order for modern architectures.) For 32-bit, the cost would be (1 * 10% + 10 * 90%) = 9.1. For 64-bit pointers, the cost would be (1 * 5% + 10 * 95%) = 9.55. You can see how they'd asymptotically approach, since the cost of the misses dominate any gains made by the hits, and doubling the hits does not halve the number of misses.

The picture is quite a bit better for 32-bit if the hit rates are higher though. Suppose the hit rate was 90% for 32-bit pointers and only 45% for 64-bit pointers. Now you have (1 * 90% + 10 * 10%) = 1.9 vs (1 * 45% + 10 * 55%) = 5.95.

Maybe if I get bored later, I could modify my program to collect a sweep of such datapoints. It might be enlightening.

It certainly suggests that 64-bit pointers aren't automatic death for performance. It also suggests that the gains 32-bit pointers might show are rather sensitive to how well your application fits in the cache to begin with, and how far the increased pointer size pushes you from "fitting" toward "not fitting". If you can tune your application to work on subproblems, it may be that you can tune both 32-bit and 64-bit variants to achieve nearly identical performance if you can make both utilize L1 effectively.

A Gentoo x32 release candidate

Posted Jun 8, 2012 21:35 UTC (Fri) by ABCD (subscriber, #53650) [Link]

Using the same benchmark on this system which *does* have x32, I get the following results:
     -m32          -m64         -mx32
  2283.403ms    3339.631ms    2282.777ms
  2278.988ms    3250.245ms    2283.710ms
  2284.797ms    3437.402ms    2285.109ms
  2295.849ms    3344.579ms    2282.430ms
  2247.007ms    2988.275ms    2227.092ms
  2189.324ms    2872.535ms    2178.817ms
  2309.024ms    3118.024ms    2278.871ms
  2341.720ms    3140.920ms    2287.304ms
  2229.621ms    2999.011ms    2207.783ms
  2295.220ms    3435.611ms    2291.899ms

A Gentoo x32 release candidate

Posted Jun 9, 2012 1:30 UTC (Sat) by vapier (guest, #15768) [Link]

the Gentoo stage3 can be d/l-ed and chrooted into. all you need is a host x86_64 kernel with x32 enabled in it. then when you're done, `rm -rf` it and you're free of Gentoo again.

A Gentoo x32 release candidate

Posted Jun 10, 2012 11:25 UTC (Sun) by man_ls (guest, #15091) [Link] (1 responses)

Perhaps Knuth always knows beforehand how much memory any given program will consume. In this world where phones have 1GB of RAM I am not sure it is always feasible, or practical.

A Gentoo x32 release candidate

Posted Jun 10, 2012 15:59 UTC (Sun) by mathstuf (subscriber, #69389) [Link]

>Perhaps Knuth always knows beforehand how much memory any given program will consume.

ISTR that TeX allocates all memory ahead of time and then runs in constant memory from there on out, so there may be some truth to this :) .

A Gentoo x32 release candidate

Posted Jun 6, 2012 20:25 UTC (Wed) by iabervon (subscriber, #722) [Link]

The machine I'm sitting at only has 4G of RAM total, and can't deal with as much data now (with an x86_64 userspace) as it could when it was 32-bit. It's nice for web browsers that they can mmap as much as they want, but I'd like to be able to store more than half a billion words in anonymous memory on this machine.


Copyright © 2025, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds