You'd copy because the whole "copy 4096 bytes" instruction is ONE instruction, "rep movsd" (or "rep movsb" which is probably internally optimized to operate on words except for non-word-aligned start/end data). The entire loop logic is internalized on the CPU, and there's no stepping through macroinstructions like "cmp," "jnz," "dec," or "loop"
The assumption here is that the CPU's internal microcode for running a loop is a lot faster than stepping through two instructions:
rep movsd ; Copy based on registers %esi, %edi, %ecx
@j00: ; label, not an instruction
mov 0,[addr+%ecx] ; Write 0x00 to addr+offset
loop ; dec %ecx && jnz @j00
One of these involves branching, and thus branch prediction. One of these involves cache, and thus prefetching... but also works internally. Which is faster?