> The assumption here is that the CPU's internal microcode for
> running a loop is a lot faster than stepping through two instructions:
I admit I haven't kept up with the ISAs since pentium era, but for a while the rep functions were in fact slower than open-coded loops. Anyway if it were true that rep movs was faster than dec/jmp, there is rep stosd which does the same thing but without copying.