Other flavours

Posted Dec 13, 2018 19:36 UTC (Thu) by excors (subscriber, #95769)
In reply to: Other flavours by epa
Parent article: The x32 subarchitecture may be removed

If you're doing two loads from consecutive addresses, the first one will pull the whole 64B cache line into L1 cache, so the second load will be extremely quick. And I expect an OoO CPU would execute both loads in parallel anyway, and the L1 cache would merge them into a single cache line request to L2, so there would be almost no difference to a single 64-bit load.

If you have a tight inner loop where such tiny differences matter, you should be using SSE/AVX anyway so that you're loading 128/256/512 bits at once and doing all your arithmetic with SIMD instructions.