Why would you copy zeros instead of filling with them? There are no reads in the fill, and therefore never any cache misses. The cache will merge the writes, too. All of the operations are independent in a fill, too. In a copy, you have to wait for reads to complete.
If you want to get really fancy on a modern ISAs but not touch DMA engines, you'd use the various prefetch-for-write and streaming write instructions that write 128 bits or more at a go. (I'm not limiting myself to x86 and SSE variants here.)