Other flavours

Posted Dec 13, 2018 16:38 UTC (Thu) by epa (subscriber, #39769)
In reply to: Other flavours by excors
Parent article: The x32 subarchitecture may be removed

I suppose the only performance benefit you might get is this. Instead of loading one 32-bit value from memory into a register, and then another 32-bit value from the following four bytes of memory into a second register, do a single 64-bit load and then shuffle the registers around. The shuffling wouldn't need to go to memory at all, provided the instructions to do it stay in the instruction cache (and perhaps you have a scratch register to use). And if neither of the two four-byte memory locations was in the data cache, a single 64-bit load might be faster than two 32-bit loads.

I agree, it doesn't really seem worth the effort in general, but perhaps in some tight inner loop that works with 32-bit arithmetic it might make a difference.

Other flavours

Posted Dec 13, 2018 19:36 UTC (Thu) by excors (subscriber, #95769) [Link]

If you're doing two loads from consecutive addresses, the first one will pull the whole 64B cache line into L1 cache, so the second load will be extremely quick. And I expect an OoO CPU would execute both loads in parallel anyway, and the L1 cache would merge them into a single cache line request to L2, so there would be almost no difference to a single 64-bit load.

If you have a tight inner loop where such tiny differences matter, you should be using SSE/AVX anyway so that you're loading 128/256/512 bits at once and doing all your arithmetic with SIMD instructions.