Is there some architecture where there's a way to do a loop without compare and branch and the cost of the compare and branch doesn't go away when followed by a load with offset? For in-order architectures I've seen, the test is no move expensive than the overhead of one loop iteration; for out-of-order architectures, the processor should know which path it's actually running shortly before it would know which address it's loading anyway. But I confess when I was last looking at out-of-order architecture design was when the Pentium 4 was being designed, so I may be expecting the processor to analyze way too much.
Posted Nov 26, 2010 20:25 UTC (Fri) by dlang (✭ supporter ✭, #313)
[Link]
I also question the expense of the test, all the values needed are already in registers, I would expect that the fact that the CPU is so much faster than the ram would let the test happen without noticably affecting the overall memory copy time (since the memory copy time needs to at least wait for the cache, if not for the main memory)