Other flavours

Posted Dec 12, 2018 20:18 UTC (Wed) by ibukanov (subscriber, #3942)
In reply to: Other flavours by plugwash
Parent article: The x32 subarchitecture may be removed

But then what makes x32 ABI different from x86 that the kernel has in any case?

Other flavours

Posted Dec 12, 2018 21:32 UTC (Wed) by mathstuf (subscriber, #69389) [Link] (6 responses)

Data types are still 64-bit in x32 where they might be 32-bit in x86. Time, file offsets, stat fields, etc.

Other flavours

Posted Dec 13, 2018 7:01 UTC (Thu) by ibukanov (subscriber, #3942) [Link] (5 responses)

In retrospect this was a wrong decision. If x32 used the same ABI as x86, the maintenance burden would be lower.

Other flavours

Posted Dec 13, 2018 8:18 UTC (Thu) by cladisch (✭ supporter ✭, #50193) [Link] (4 responses)

At that time, Linus said (https://lwn.net/Articles/456750/):
> And I really do think that a new 32-bit ABI is *much* better off trying to be compatible
> with x86-64 (and avoiding things like 2038) than it is trying to be compatible with the
> old-style x86-32 binaries. I realize that it may be *easier* to be compatible with x86-32
> and just add a few new system calls, but I think it's wrong.
>
> 2038 is a long time away for legacy binaries. It's *not* all that long away if you are
> introducing a new 32-bit mode for performance.

As far as I can see, this assessment has not changed since then.

Other flavours

Posted Dec 13, 2018 13:31 UTC (Thu) by plugwash (subscriber, #29694) [Link] (3 responses)

What has changed since then is a mechanism has been introduced to allow 32-bit architectures to use 64-bit time. There has been a mechanism for 32-bit architectures to use 64-bit file offsets for a long time.

Other than time and file offsets I can't think of much that has a pressing need to be 64-bit on 32-bit systems.

Other flavours

Posted Dec 13, 2018 19:02 UTC (Thu) by jccleaver (guest, #127418) [Link] (2 responses)

> What has changed since then is a mechanism has been introduced to allow 32-bit architectures to use 64-bit time. There has been a mechanism for 32-bit architectures to use 64-bit file offsets for a long time.

This absolutely is key. If this decision were to be rolled back in the interests of a meaningfully usable x32, it would be a great step.

Other flavours

Posted Dec 13, 2018 23:22 UTC (Thu) by farnz (subscriber, #17727) [Link] (1 responses)

Why? 64-bit time for 32-bit architectures is worth the effort because there still exist 32-bit only systems being sold today (ARM Cortex-R range, including new designs, for example, not to mention embedded systems using older ARMv7-A cores, plus anything designed around the DM&P Vortex86 SoCs or RDC's Emkore and IAD chips which are x86 CPUs with no 64-bit support). Thus, we need to address this anyway; these chips are going to be around for a while, and saying that brand new hardware designed in 2019 (or probably 2020) is going to be worthless before 2038 isn't exactly nice.

OTOH, x32 is just a potential speedup for users who could use amd64 or i386 ABIs; it doesn't expand the user base by any significant amount, and does involve engineering effort.

Other flavours

Posted Dec 14, 2018 8:30 UTC (Fri) by joib (subscriber, #8541) [Link]

I think the argument was that it would be nicer if X32 would use the generic support for 64-bit time_t for 32-bit targets rather than having its own custom way of doing it.

Other flavours

Posted Dec 12, 2018 21:37 UTC (Wed) by ken (subscriber, #625) [Link] (6 responses)

64 bit instructions and more registers. x32 is x86-64 minus the 64 bit address space. So the only speed gain is the small amount of extra speed you get by not using so much cache space to store the pointers.

Other flavours

Posted Dec 13, 2018 8:36 UTC (Thu) by epa (subscriber, #39769) [Link] (5 responses)

If you are working with 32-bit integers and 32-bit pointers, but running in the CPU's 64-bit mode, could you in principle get even more registers by treating each 64-bit register as two 32-bit values? The compiler would need to handle this for you.

Other flavours

Posted Dec 13, 2018 11:01 UTC (Thu) by NAR (subscriber, #1313) [Link]

In that case registers would overflow into each other - sounds like big mess.

Other flavours

Posted Dec 13, 2018 11:47 UTC (Thu) by excors (subscriber, #95769) [Link] (2 responses)

Most x86-64 instructions with 32-bit outputs will set the high 32 bits of the output register to zero, so you'd need to do a lot of extra work (masking and shifting and oring) to preserve those high bits, which sounds like it would eliminate the performance benefits of storing more data in registers.

ARM NEON does let you split up registers like that - the 32-bit registers S0 and S1 are the two halves of the 64-bit register D0, and D0/D1 are the two halves of 128-bit Q0, and so on up to D30/D31 = Q15. But that makes it much harder for an out-of-order CPU to accurately determine dependencies between instructions and do correct register renaming, so AArch64 got rid of that aliasing - now S0/D0/Q0 share one physical register, S1/D1/Q1 share another, etc. Better to sacrifice some utilisation of register space in exchange for a simpler and more efficient microarchitecture.

Other flavours

Posted Dec 13, 2018 16:38 UTC (Thu) by epa (subscriber, #39769) [Link] (1 responses)

I suppose the only performance benefit you might get is this. Instead of loading one 32-bit value from memory into a register, and then another 32-bit value from the following four bytes of memory into a second register, do a single 64-bit load and then shuffle the registers around. The shuffling wouldn't need to go to memory at all, provided the instructions to do it stay in the instruction cache (and perhaps you have a scratch register to use). And if neither of the two four-byte memory locations was in the data cache, a single 64-bit load might be faster than two 32-bit loads.

I agree, it doesn't really seem worth the effort in general, but perhaps in some tight inner loop that works with 32-bit arithmetic it might make a difference.

Other flavours

Posted Dec 13, 2018 19:36 UTC (Thu) by excors (subscriber, #95769) [Link]

If you're doing two loads from consecutive addresses, the first one will pull the whole 64B cache line into L1 cache, so the second load will be extremely quick. And I expect an OoO CPU would execute both loads in parallel anyway, and the L1 cache would merge them into a single cache line request to L2, so there would be almost no difference to a single 64-bit load.

If you have a tight inner loop where such tiny differences matter, you should be using SSE/AVX anyway so that you're loading 128/256/512 bits at once and doing all your arithmetic with SIMD instructions.

Other flavours

Posted Dec 13, 2018 12:01 UTC (Thu) by joib (subscriber, #8541) [Link]

If the world has learnt something from the partial register stall mess of ye olde x86, then it's "lets not do that anymore".

For x86 with mem-op instructions and register renaming, 16 GPR's is for most purposes enough.