The x32 system call ABI

By Jonathan Corbet
August 29, 2011

The 32-bit x86 architecture has a number of well-known shortcomings. Many of these were addressed when this architecture was extended to 64 bits by AMD, but running in 64-bit mode is not without problems either. For this reason, a group of GCC, kernel, and library developers has been working on a new machine model known as the "x32 ABI." This ABI is getting close to ready, but, as a recent discussion shows, wider exposure of x32 is bringing some new issues to the surface.

Classic 32-bit x86 has easily-understood problems: it can only address 4GB of memory and its tiny set of registers slows things considerably. Running a current processor in the 64-bit mode fixes both of those problems nicely, but at a cost: expanding variables and pointers to 64 bits leads to expanded memory use and a larger cache footprint. It's also not uncommon (still) to find programs that simply do not work properly on a 64-bit system. Most programs do not actually need 64-bit variables or the ability to address massive amounts of memory; for that code, the larger data types are a cost without an associated benefit. It would be really nice if those programs could take advantage of the 64-bit architecture's additional registers and instructions without simultaneously paying the price of increased memory use.

That best-of-both-worlds situation is exactly what the x32 ABI is trying to provide. A program compiled to this ABI will run in native 64-bit mode, but with 32-bit pointers and data values. The full register set will be available, as will other advantages of the 64-bit architecture like the faster SYSCALL64 instruction. If all goes according to plan, this ABI should be the fastest mode available on 64-bit machines for a wide range of programs; it is easy to see x32 widely displacing the 32-bit compatibility mode.

One should note that the "if" above is still somewhat unproven: actual benchmarks showing the differences between x32 and the existing pure modes are hard to come by.

One outstanding question - and the spark for the current discussion - has to do with the system call ABI. For the most part, this ABI looks similar to what is used by the legacy 32-bit mode: the 32-bit-compatible versions of the system calls and associated data structures are used. But there is one difference: the x32 developers want to use the SYSCALL64 instruction just like native 64-bit applications do for the performance benefits. That complicates things a bit, since, to know what data size to expect, the kernel needs to be able to distinguish system calls made by true 64-bit applications from those running in the x32 mode, regardless of the fact that the processor is running in the same mode in both cases. As an added challenge, this distinction needs to be made without slowing down native 64-bit applications.

The solution involves using an expanded version of the 64-bit system call table. Many system calls can be called directly with no compatibility issues at all - a call to fork() needs little in the translation of data structures. Others do need the compatibility layer, though. Each of those system calls (92 of them) is assigned a new number starting at 512. That leaves a gap above the native system calls for additions over time. Bit 30 in the system call number is also set whenever an x32 binary calls into the kernel; that enables kernel code that cares to implement "compatibility mode" behavior.

Linus didn't seem to mind the mechanism used to distinguish x32 system calls in general, but he hated the use of compatibility mode for the x32 ABI. He asked:

I think the real question is "why?". I think we're missing a lot of background for why we'd want yet another set of system calls at all, and why we'd want another state flag. Why can't the x32 code just use the native 64-bit system calls entirely?

There are legitimate reasons why some of the system calls cannot be shared between the x32 and 64-bit modes. Situations where user space passes structures containing pointers to the kernel (ioctl() and readv() being simple examples) will require special handling since those pointers will be 32-bit. Signal handling will always be special. Many of the other system calls done specially for x32, though, are there to minimize the differences between x32 and the legacy 32-bit mode. And those calls are the ones that Linus objects to most strongly.

It comes down, for the most part, to the format of integer values passed to the kernel in structures. The legacy 32-bit mode, naturally, uses 32-bit values in most cases; the x32 mode follows that lead. Linus is saying, though, that the 64-bit versions of the structures - with 64-bit integer values - should be used instead. At a minimum, doing things that way would minimize the differences between the x32 and native 64-bit modes. But there is also a correctness issue involved.

One place where the 32- and 64-bit modes differ is in their representation of time values; in the 32-bit world, types like time_t, struct timespec, and struct timeval are 32-bit quantities. And 32-bit time values will overflow in the year 2038. If the year-2000 issue showed anything, it's that long-term drop-dead days arrive sooner than one tends to think. So it's not surprising that Linus is unwilling to add a new ABI that would suffer from the 2038 issue:

2038 is a long time away for legacy binaries. It's *not* all that long away if you are introducing a new 32-bit mode for performance.

The width of time_t cannot change for legacy 32-bit binaries. But x32 is an entirely new ABI with no legacy users at all; it does not have to retain any sort of past compatibility at this point. Now is the only time that this kind of issue can be fixed. So it is probably entirely safe to say that an x32 ABI will not make it into the mainline as long as it has problems like the year-2038 bug.

At this point, the x32 developers need to review their proposed system call ABI and find a way to rework it into something closer to Linus's taste; that process is already underway. Then developers can get into the serious business of building systems under that ABI and running benchmarks to see whether it is all worth the effort. Convincing distributors (other than Gentoo, of course) to support this ABI will take a fairly convincing story, but, if this mode lives up to its potential, that story might just be there.

Index entries for this article
Kernel	User-space API
Kernel	x32

Memory seen from a single process

Posted Sep 1, 2011 1:21 UTC (Thu) by cma (guest, #49905) [Link] (4 responses)

A doubt here...

Does x32 will provide for a single process for mapping/seeing more than 2GB of ram?

Memory seen from a single process

Posted Sep 1, 2011 3:38 UTC (Thu) by foom (subscriber, #14868) [Link]

Yes...4GB!

Memory seen from a single process

Posted Sep 1, 2011 5:18 UTC (Thu) by Tuna-Fish (guest, #61751) [Link] (2 responses)

Since the kernel will always be running x64 and using address space well clear of the first 32 bits, 32-bit user programs running on it always have full 4GB available (well, less the first page usually).

If you need more than 4GB, you should compile your program for native 64 bit.

Memory seen from a single process

Posted Sep 13, 2011 17:43 UTC (Tue) by cma (guest, #49905) [Link] (1 responses)

Thanks! So this could be a problem for apps needing more than 4GB. Like MySQL with larger buffers or a memory based DB. Regards

Memory seen from a single process

Posted Sep 13, 2011 17:51 UTC (Tue) by dlang (guest, #313) [Link]

yes, if you have one application that needs more than 4G itself (not counting memory used by the kernel internally, or used by the kernel to buffer disk I/O), then you need to use AMD64, not x32

a large database is a perfect example of a situation where you would want the full 64 bits available.

given these other memory usesin a system, it's very likely that a machine with 6-8G of ram that's dedicated for database use could still be very happy with x32

however, if you are splitting the database up using sharding (where you have multiple database instances, which could live on separate machines, including virtual machines), it's very possible that each one will only need 4G or less of address space even with far more ram.

also, if you have a database like postgres that used multiple processes (instead of multiple threads), you should recognize that each process can have 4G of address space, so unless you have a huge amount of shared memory allocated, 4G per process may be a very comfortable limit.

The x32 system call ABI

Posted Sep 1, 2011 3:57 UTC (Thu) by njs (subscriber, #40338) [Link] (8 responses)

Curiously, that first link does have some benchmarks, and in none of them is x32 actually the best choice -- on one of them ia32 wins, and on one of them x86-64 wins. I guess this must reflect some lack of optimization in the toolchain or something, since I can't see how adding more registers could ever legitimately make a CPU-bound 32-bit program *slower*...?

The x32 system call ABI

Posted Sep 1, 2011 4:24 UTC (Thu) by jzbiciak (guest, #5246) [Link]

Further down the page is this note:

    GCC
        The current x32 implementation isn't optimized:
            Atom LEA optimization is disabled.
            Memory addressing should be optimized.

So, that presumably accounts for why 181.mcf slowed down 0.5% to 1% relative to normal 32-bit x86.

The x32 system call ABI

Posted Sep 1, 2011 8:08 UTC (Thu) by slashdot (guest, #22014) [Link] (5 responses)

I don't understand why they need to change the kernel and add a new "x32" ABI.

Why not just have x32 programs use the x86-64 system calls and otherwise behave as normal x86-64 programs from the kernel's perspective?

The only difference would then be that they would only use 4GB of address space (mmap with MAP_32BIT), and store pointers in 32-bit-sized locations in memory.

In fact, you could probably use #pragma and/or __attribute__ to specify pointer size, and use a 64-bit libc, while most other libraries and the executable are 32-bit.

The x32 system call ABI

Posted Sep 1, 2011 11:56 UTC (Thu) by and (guest, #2883) [Link] (4 responses)

The problem is that some structures contain pointers which are always 64 bits in kernel space but in x32 the userspace only has 32 bits. The "new" system calls thus have to translate these pointers before the structures can be used by "normal" kernel space code.

The x32 system call ABI

Posted Sep 1, 2011 12:53 UTC (Thu) by cesarb (subscriber, #6266) [Link] (3 responses)

Why not simply use the 64-bit structures then?

When putting a pointer into these structures, it can simply be zero-extended.

Only memory allocation system calls would need a new flag (to allocate below 4G). Other than these, the kernel does not have to change at all. The rest could be done in userspace.

(The only other change needed in the kernel would be to add a flag in the executable file format to make ASLR use only the lower 32 bits.)

The x32 system call ABI

Posted Sep 1, 2011 22:17 UTC (Thu) by hummassa (subscriber, #307) [Link] (2 responses)

If you use the 64 bit structures, then you have a 64-bit userland program...
Resuming:
x86_64/amd64 => wastes space (and cache == performance), many registers
ia32 => gains space, few registers
x32 => gains space, many registers

The x32 system call ABI

Posted Sep 5, 2011 22:05 UTC (Mon) by butlerm (subscriber, #13312) [Link] (1 responses)

I believe the idea here is _not_ to use 64 bit pointers everywhere, but rather to use 64 bit pointers in certain circumstances, and do one of the following:

(1) Change the source level API for all pertinent ioctl structures that contain pointers so that programs have to manually zero extend a 32 bit pointer into some sort of opaque 64 bit value.
(2) Use a compiler extension that does this transparently, i.e. that supports a special pointer type where the high order bits are always zero.

I suspect (1) would break source compatibility in far too many places, although it seems like it is what should have been done way back when these interfaces were first designed.

(2) seems ideal, but requires cooperation for every supporting compiler. I don't know exactly why, but the x32 ABI devs are trying to avoid that if at all possible.

The x32 system call ABI

Posted Sep 6, 2011 1:05 UTC (Tue) by cesarb (subscriber, #6266) [Link]

There is also:

(3) Keep the source level API 32-bit, but have glibc do the zero-extension into the true 64-bit API before calling the kernel.

The main problem with that is, of course, ioctl (the same compat ioctl problem the kernel already has). So, how about this:

(4) Same as (3) but add a new x32_ioctl 64-bit syscall which calls into the compat ioctl engine the kernel already has.

The x32 system call ABI

Posted Sep 1, 2011 22:28 UTC (Thu) by daglwn (guest, #65432) [Link]

> can't see how adding more registers could ever legitimately make a
> CPU-bound 32-bit program *slower*...?

Several things could conspire to make this happen, besides the lack of optimization already noted.

- Function calls are more expensive due to additional callee-save registers.
- Systems calls are more expensive due to larger context save and restore.
- Things like setjmp/longjmp are slower for the same reason.
- Longer instruction encoding causes icache pressure.

Then there are all sorts of microarchitecture changes resulting from the ISA additions that can reduce clock-for-clock performance. Things like longer pipelines to compensate for more complicated instruction decoding, though these are likely secondary at best.

The x32 system call ABI

Posted Sep 1, 2011 10:28 UTC (Thu) by nix (subscriber, #2304) [Link] (2 responses)

This also assumes that native x86-64 won't have 512 syscalls at any point in the future. This suggests that the rate of syscall addition will slow or stop. This seems... unlikely, unless everyone falls in love with giant multiplex syscalls again.

The x32 system call ABI

Posted Sep 2, 2011 14:50 UTC (Fri) by BenHutchings (subscriber, #37955) [Link] (1 responses)

x32 syscall behaviour is supposed to be distinguished by a high bit set on %eax, not by the syscall table index. The choice to start numbering from 512 seems to be intended to avoid a collision with other additions made in parallel, not to provide a permanent distinction between native x86_64 and x32 syscalls.

The x32 system call ABI

Posted Sep 3, 2011 19:45 UTC (Sat) by nix (subscriber, #2304) [Link]

Ah, right, perfectly normal procedure then, just with a much bigger gap than I'm used to :)

The x32 system call ABI

Posted Sep 1, 2011 22:45 UTC (Thu) by gerdesj (subscriber, #5446) [Link] (15 responses)

>Convincing distributors (other than Gentoo, of course) to support this ABI

Surely it is application developers rather than distributors who would support this thing?

In the Gentoo case, presumably I'd merely have to remember to pick a kernel option before emerging something that could use this. Said option would be mentioned in the middle of a 300 package emerge, after setting a USE flag, which I'd never notice 8)

For now fixing how CUPS can cause a 2 hour load time for a Libre Office file is probably going to yield better performance improvements (its something to do with printers being unavailable away from "home")

I am not an expert but this look like a bodge of some sort. An application is written to work with a 2^n bit system. If it runs on a 2^n bit system then great. If not then you'll need a compatibility layer.

Surely the 64 bit version of a (previously) 32 bit app can be efficient in terms of memory and register usage.

I can't help but be reminded of the 16 -> 32 bit migration.

FIX THE BLOODY APPLICATION!

Cheers
Jon

The x32 system call ABI

Posted Sep 1, 2011 23:00 UTC (Thu) by dlang (guest, #313) [Link] (14 responses)

depending on the application, the fact that pointers and memory addresses change from 32 bits to 64 bits can actually slow the system significantly.

the larger footprint uses more CPU cache, making the system spend more time waiting for the cache to be updated from memory.

this is why many of the chips that have both 32 bit and 64 bit modes tend to run 64 bit kernels with 32 bit userspace, for programs that don't need to address more than 4G of ram, the overhead of the larger data objects results in a slowdown

x86/AMD64 is pretty much unique in the fact that 64 bit mode doesn't just extend the registers to 64 bits, it also gives you twice as many registers to work with. Since the x86 platform has far fewer registers than more modern designs, the availability of more registers means that far more things can happen in the CPU itself, without having to save and load values out to the cache (or worse yet, to RAM) in a constant shuffle to get register space for the things that need it. x86 systems spend a HUGE amount of time doing this register shuffle.

the idea behind the x32 architecture is to be able to take advantage of these extra registers (which almost always result in improved performance) without having to pay the overhead of larger pointers to memory.

the fact that many 32 bit applications that are nto 64 bit clean can be made to run in this mode is pure gravy, and if the time change takes place, this may be sacraficed in order to get a better long-term x32 architecture.

That way lies madness

Posted Sep 2, 2011 5:35 UTC (Fri) by eru (subscriber, #2753) [Link] (13 responses)

I can see the reasoning, but still I feel the ideal is very bad. It reminds me too much of the "memory models" of MS-DOS, 16-bit Xenix and 16-bit OS/2, and the problems associated to having then separate library versions of each, and slightly different requirements and capacities of programs depending on how they were compiled. Been there, and did not like it. Please don't bring this mess to Linux!

Having more modes just means more available ways for the programmer to screw things up, and more possibilities for low-level bugs and security holes in the kernel and C library. The now-existing 32-bit mode in x86_64 is justifiable for supporting legacy binaries, but other memory models will just complicate things with very little gain.

That way lies madness

Posted Sep 2, 2011 14:41 UTC (Fri) by nix (subscriber, #2304) [Link] (12 responses)

Please don't bring this mess to Linux!

Linux has had 'this mess' since the days of SPARC64 in the 90s, and now with x86-64 and x86-32, biarchy is downright common. The linker and dynamic linker know about it, and you cannot accidentally link against the wrong library. Biarch packaging problems have largely been weeded out by the ubiquity of x86-64.

That way lies madness

Posted Sep 2, 2011 19:02 UTC (Fri) by dlang (guest, #313) [Link] (11 responses)

I know hat Sparc and PowerPC both have this sort of 32/64 split.

and as far as I have seen, almost all distros for those chips ship 64 bit kernels with 32 bit userspace because 32 bit binaries are faster to run than 64 bit ones (due to the more compact code and memory addresses), as long as you can live in 4G of address space as an application.

there are actually very few cases where a single application needs to address more than 4G of address space, and in many, if not most of those cases there are real advantages to just running multiple processes rather than a single giant process. so this works very well in the real world.

That way lies madness

Posted Sep 2, 2011 22:36 UTC (Fri) by martinfick (subscriber, #4455) [Link] (7 responses)

> there are actually very few cases where a single application needs to address more than 4G of address space,

I guess you think java applications are few. :)

That way lies madness

Posted Sep 2, 2011 22:47 UTC (Fri) by dlang (guest, #313) [Link] (6 responses)

yes, the number of Java applications where a single application needs to address more than 4G of memory are few.

remember that visualization is supposed to be the wave of the future, especially for things in datacenters. part of the way this works is that you slice up the memory available on a server to allocate it between many more small servers. most such servers end up with less than 4G per virtual server and what we are talking about for x32 is 4G per _application_ (not counting OS buffering, kernel allocations, or any other overhead) this is a lot more elbow room.

not every application can fit in 4G, but when you really look at it, a surprising number of them will.

and pointer-heavy things like Java are especially likely to benifit from the smaller pointers of x32

That way lies madness

Posted Sep 5, 2011 7:48 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

Java actually benefits so much, that Oracle JVM actually implements userspace pointer compression!

http://wikis.sun.com/display/HotSpotInternals/CompressedOops and
http://blog.juma.me.uk/tag/compressed-oops/

That way lies madness

Posted Sep 5, 2011 22:38 UTC (Mon) by intgr (subscriber, #39733) [Link] (4 responses)

> and pointer-heavy things like Java are especially likely to benifit from
> the smaller pointers of x32

Offtopic, but interesting: 64-bit Java already offers the -XX:+UseCompressedOops option which turns on pointer compression. By dropping 3 bits from the least significant end of the address, it can address 32GB of memory using 32-bit pointer fields.

That way lies madness

Posted Sep 6, 2011 14:35 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

Userspace pointer compression has its own costs. In my tests it often performs worse than non-compressed version.

That way lies madness

Posted Apr 2, 2012 14:30 UTC (Mon) by Richard_J_Neill (subscriber, #23093) [Link] (2 responses)

This is quite a clever trick. If I understand rightly, what Java is doing is giving up byte-addressability, in favour of more address space. I.e. you can't create a pointer to a byte/char any more; the smallest data-type then becomes an int, and strings have to contain 4*n bytes. Given that x86 accesses memory 32-bits at a time anyway, this is a fairly natural thing to do.

That way lies madness

Posted Apr 3, 2012 20:49 UTC (Tue) by ibukanov (subscriber, #3942) [Link] (1 responses)

> Given that x86 accesses memory 32-bits at a time anyway,

On modern CPU memory is addressed internally by cache lines that are typically 16-32-64 bytes in size. On x86 the byte access is just as fast as 32-bit access. Moreover, misaligned access to 32-bit values is allowed and is not costly as long as the variable does not cross the cache line boundary.

That way lies madness

Posted May 21, 2012 15:08 UTC (Mon) by mikemol (guest, #83507) [Link]

For basic instructions, yes. Take a look at the SSE instructions; while there are unaligned and aligned versions for several, the aligned versions will carry better performance.

That way lies madness

Posted Sep 3, 2011 11:50 UTC (Sat) by raven667 (subscriber, #5198) [Link] (2 responses)

I kind of wish that the Linux distros took that approach when transitioning to x86_64, I am of the opinion that the transition would have been a lot smoother especially for desktops had that been the case.

That way lies madness

Posted Sep 3, 2011 17:12 UTC (Sat) by dlang (guest, #313) [Link] (1 responses)

there were two big factors that cause distros to go the direction they did for AMD64

1. especially early on there were problems with the compatibility mode causing occasional 'strange' errors when running 32 bit userspace on a 64 bit kernel.

2. the added registers of 64 bit mode significantly improve the performance of 64 bit code vs 32 bit code, in almost every case even when you take into account the extra overhead of the larger pointers.

That way lies madness

Posted Sep 6, 2011 3:30 UTC (Tue) by butlerm (subscriber, #13312) [Link]

Isn't this likely to be superior enough to motivate desktop distributions to switch to an x32 user space with a 64 bit kernel, with x86-64 libraries as extensions for those applications that actually benefit from a large address space?

MS and Apple? Intel or AMD?

Posted Sep 6, 2011 0:24 UTC (Tue) by kragilkragil2 (guest, #76172) [Link]

Do Windows or OSX do it this way?
My guess is they don't because they would need compiler support, but I also think that some engineers at Apple, MS, Intel or AMD did the benchmarks and concluded that it isn't worth it. AFAIK modern CPUs use some sort of shadow registers or something to mask away the performance penalties you get by having so few registers.

The x32 system call ABI

Posted Sep 6, 2011 5:49 UTC (Tue) by gmaxwell (guest, #30048) [Link] (13 responses)

ugh. I'm having such a hard time making myself believe that this is a good idea.

The inevitable result of this is that I'm going to have _two_ copies of most of my system libraries in core at all times, and we'll be back to the bad old days where common software isn't 64 bit clean (right now its mostly only proprietary crap-ware like flash thats problematic)

And for what? so a very few overly pointered memory bandwidth bound test cases can run faster? Any many of these cases could run just as well by switching to (e.g.) using pointer offsets internally (which would also reduce their scalability, but no worse than switching to 32 bit mode).

The x32 system call ABI

Posted Sep 6, 2011 12:27 UTC (Tue) by liljencrantz (guest, #28458) [Link]

Agreed. This, to me, sounds like over-optimizing.

Aside from the possibility of getting a 64-bit time_t on 32-bit systems, this sounds like a huge waste of time.

The x32 system call ABI

Posted Sep 7, 2011 6:23 UTC (Wed) by butlerm (subscriber, #13312) [Link] (3 responses)

If x32 compiled distributions run significantly faster than x64, it seems rather likely to me that desktop users will generally end up with _one_ x32 copy of system libraries in memory, with x64 libraries only loaded for the occasional application that needs a very large memory space.

With open source applications, what is there to complain about? If you don't like x32 just use x64 only.

And of course the big advantage of x32 over pointer compression is that no source modifications are required, modifications that in a typical C application would be extremely painful.

The x32 system call ABI

Posted Sep 7, 2011 6:47 UTC (Wed) by gmaxwell (guest, #30048) [Link] (2 responses)

"If you don't like x32 just use x64 only" which means I get to go back to the bad old days of playing (int) to (void *)/(size_t) conversion guy because when 64 bit systems weren't commonly deployed on developers desktops a lot of stuff simply didn't work without a bunch of fuss. The freedom of open source has tremendous but not infinite value there is a real cost to being an oddball.

"If x32 compiled distributions run significantly faster than x64" IFF, but based on the currently available micro-benchmarks this seems unlikely. I've yet to see an example of a single application which is faster in x32 than best_of(x86,x86_64), and if we're in the two libraries mode then taking the choice of x86 for those few memory bandwidth bound pointer heavy apps that don't mine the scalability constraint is no worse.

"occasional application" like... my browser? (which is currently using ~4GiB of VM, though not as much resident obviously).

Not to mention the reduced address space for ASLR.

The x32 system call ABI

Posted Sep 8, 2011 21:19 UTC (Thu) by JanC_ (guest, #34940) [Link]

It's using almost 4 GiB on a 64-bit system now? But of course your browser would supposedly need significantly less memory when running in x32 mode? And once Firefox also uses out-of-process rendering (like Chrome/Chromium), that would become even less of an issue...?

The x32 system call ABI

Posted Sep 9, 2011 2:30 UTC (Fri) by butlerm (subscriber, #13312) [Link]

>I've yet to see an example of a single application which is faster in x32 than best_of(x86,x86_64)

That is the wrong metric to judge an ABI by - unless you agree that we should stick with an x86 + x86_64 biarchy indefinitely, and have distributions compile every other application appropriately. Then we really will end up with both sets of libraries pinned in memory.

x32 is noticeably better than x86, on some benchmarks as much as 30% more. It is also noticeably better than x86_64, another 30% on important workloads. It is a better all around ABI for most applications.

x86 is stunted, and will hopefully go away in a few years. But x32 sounds like it is worth keeping around for a long time. A 30% performance increase on many workloads isn't the sort of thing you want to idly throw away.

The x32 system call ABI

Posted Sep 9, 2011 12:23 UTC (Fri) by NikLi (guest, #66938) [Link] (7 responses)

It is not "pointer memory bandwidth bound test cases".

A vm like python uses a *lot* of pointers:

- a list of 'n' items is a buffer of 'n' pointers. Same for tuples.
- a dictionary of 'n' items is a buffer of ~6*n pointers
- every string item carries a pointer
- every instance is a dictionary plus a couple of pointers

C programmers think with memory buffers but for dynamic languages where objects work by reference are mostly based on tons of pointers; this is what makes them dynamic. And yes, making all those pointers half their size is very important. Because imagine that when you want to look up something in a list, this list is fetched to the cache and all the pointers are traversed while looking for the item. Fetching a 2k buffer is better than fetching a 4k buffer. In fact, x86 might be more suitable than x86-64 for such vms!

(It would be very interesting to see some python benchmarks for x32 vs x86, nontheless)

Now, one may say that "if you want speed, do it in C". However making a dynamic language faster will benefit thousands of programs written in that language, which is important for some people..

Using pointer offsets suffers from one extra indirection and will kill a big part of the cache. On the other hand, pointing to more than 4G of things is an overkill.

The x32 system call ABI

Posted Sep 9, 2011 14:12 UTC (Fri) by gmaxwell (guest, #30048) [Link] (6 responses)

> Using pointer offsets suffers from one extra indirection and will kill a big part of the cache. On the other hand, pointing to more than 4G of things is an overkill.

You use a single offset (after all, we're assuming you're willing to take a 4G limit in these applications) and keep it in a register.

Alternatively, how about an ABI that promises you that you can get memory under the 4G mark and you use 32 bits internally, and covert at the boundaries to external libraries. This way single applications can be 32 bit without overhead but it doesn't drag the whole system with it?

The x32 system call ABI

Posted Sep 9, 2011 15:45 UTC (Fri) by dlang (guest, #313) [Link] (5 responses)

what you are trying to describe is basically what the x32 architecture is doing.

however you missed that libraries can allocate memory as well, and so the libraries must be compiled to only request memory under 4G as well.

The x32 system call ABI

Posted Sep 9, 2011 18:12 UTC (Fri) by gmaxwell (guest, #30048) [Link] (1 responses)

It would only take a single syscall to the kernel to tell it to never give this process access to _any_ address space outside of the first 4gb (not via sbrk, mmap, etc).

It would have ~all the performance benefits without doubling the libraries in memory. It wouldn't, however, retain the benefit of reduced porting benefit of existing 32bit crapware since pointers in library owned structures would be the other size. ::shrugs::

The x32 system call ABI

Posted Sep 11, 2011 3:23 UTC (Sun) by butlerm (subscriber, #13312) [Link]

What you describe could be done, but it would be difficult to implement, require special compiler support to do well, and would break source compatibility even with special compiler support.

It would be essentially the same as adding support for 80286 style near and far pointers across the code base. In C, every structure, every header file, every shared pointer declaration would potentially have to be marked whether it was using large or small pointers. The compiler certainly wouldn't know that an arbitrary function or structure declaration was referring to something from a library, and some libraries would have to come in a non-standard flavor in any case.

Now as you say, there are certain advantages to that, in terms of memory and cache footprint. They did it back in the x286 era for a reason. But it is much more impractical to implement that sort of thing across the source code for practically everything then simply to compile under a new ABI, especially if the new ABI performs well enough to be the system default.

A reasonable distribution policy could be to replace x86 with x32, and not ship x86_64 libraries in x32 distributions. It could simply say that if you want have a 64 bit user space, you should use a full 64 bit version. 64 bit addressing could be reserved for the kernel. If I were to guess, half of the people currently planning to use x32 (e.g. in embedded applications) have that sort of thing in mind in any case.

The x32 system call ABI

Posted Apr 9, 2012 21:28 UTC (Mon) by snadrus (guest, #60224) [Link] (2 responses)

What about building x32 off ia32 compatibility? There would be no kernel changes, but just compiler changes to use the additional registers. You may even be able to use ia32 or x32 libraries interchangeably if you're not passing by register.

The x32 system call ABI

Posted Apr 10, 2012 9:08 UTC (Tue) by khim (subscriber, #9252) [Link] (1 responses)

1. Open wikipedia. Read.
2. Try to pretend you never asked this question.

Perhaps then you'll be considered seriously in some future architecture dispute.

Your worst yet, and for me your last

Posted Apr 13, 2012 6:37 UTC (Fri) by biged (guest, #50106) [Link]

Khim, you have exceeded your usual levels of hostility and brashness with this comment, and so I have added you to my filter. (I mention this as a reminder to others: My Account -> Comment Filtering.)

Your response here is beyond rude: it is poisonous. You should realise that with more time and attention someone might be able to explain the misconception, help others and avoid insulting anyone.

Please stop treating LWN as your inbox: post less often, and more thoughtfully. For me, you have become a spammer.

The x32 system call ABI

Posted Sep 10, 2011 8:27 UTC (Sat) by bersl2 (guest, #34928) [Link] (2 responses)

I know nobody cares, but seeing "x32" makes my blood boil as much as seeing "x64" did.

The x32 system call ABI

Posted Sep 10, 2011 17:11 UTC (Sat) by dlang (guest, #313) [Link] (1 responses)

I can see reasons for preferring AMD64 to x64, but why are you offended by x32 and what would you prefer instead.

The x32 system call ABI

Posted Sep 11, 2011 14:28 UTC (Sun) by Baylink (guest, #755) [Link]

I believe bersl thinks the same thing I do: it should be x86_32, for parallelism with x86_64.

The x32 system call ABI

Posted Apr 15, 2012 21:16 UTC (Sun) by tenchiki (subscriber, #53749) [Link] (4 responses)

I hope the people developing the x32 ABI have looked at what worked and what didn't work in the IRIX n32 ABI; it had a lot of the same requirements and issues. IRIX 64-bit systems ran %99+ of userspace as n32 for the same reasons mentioned (most apps don't need >2G address space, and 32bit pointers used the cache better).
One of the notable specs for the n32 ABI that don't seem to have been mentioned for x32 is to make the long datatype to be 64bits:

ABI: o32 n32 64
int 32 32 32
long 32 64 64
pointer 32 32 64
(all other types same size)

The x32 system call ABI

Posted Apr 15, 2012 23:05 UTC (Sun) by khim (subscriber, #9252) [Link] (3 responses)

One of the notable specs for the n32 ABI that don't seem to have been mentioned for x32 is to make the long datatype to be 64bits

Why would anyone want this? x32 uses ILP32 model to minimize difference between IA32 mode and x32 mode. Any other choice just looks… strange.

The x32 system call ABI

Posted Apr 16, 2012 6:56 UTC (Mon) by paulj (subscriber, #341) [Link] (2 responses)

Because it lets you have access to the 64bit arithmetic capabilities of the hardware, even when you don't need the 64bit addressing capabilities.

The x32 system call ABI

Posted Apr 16, 2012 9:00 UTC (Mon) by khim (subscriber, #9252) [Link] (1 responses)

What's wrong with using long long for that?

The x32 system call ABI

Posted Apr 16, 2012 11:07 UTC (Mon) by paulj (subscriber, #341) [Link]

Hmm, good point. Actually, I think in IRIX n32 it was indeed "long long" that was 64bit! (Making 64bit long risked breaking software that assumed sizeof(long) == sizeof(void *) and tried to store longs in pointers).

The x32 system call ABI

Posted Dec 2, 2012 14:08 UTC (Sun) by normcf (guest, #88125) [Link] (1 responses)

Having found the conversation interesting, I just want to interject a small historical thought. About 30 years ago, I worked on a Burroughs B6700. This machine had a 48bit architecture and the virtual memory management that was quite advanced. In this discussion, I have heard many consider the tradeoffs of 64bit pointers in the cache vs most programs requiring < 4G of user space. I offer a compromise of 48 bits, which would greatly extend the date/time issue, give space for almost all programs and still not be piggish on pointers in cache. Of course, I presume there are plenty of downsides of this, including, perhaps, hardware issues, but if we're doing all this work anyway, changing system calls, compilers, loaders etc., maybe considering a middle ground would be useful. Please just ignore me if it is too ridiculous. Thanks.

The x32 system call ABI

Posted Dec 2, 2012 23:06 UTC (Sun) by dlang (guest, #313) [Link]

all current hardware has significant benefits if the data is aligned properly (32 bit values aligned on a multiple of 4 bytes, 64 bit values aligned on a multiple of 8 bytes). As a result, in the normal case, 48 bit unaligned values are going to be slower to use.

If you have a huge array of pointers, the memory savings will outweigh this cost, but not for the normal uses.