I came here to say pretty much the same thing. Instructions in general are waaaaay faster than memory, so caring about branch predictor performance on an "easy" case (in this case, a long-running memory fill loop) is just silly. Modern CPUs issue multiples of instructions per cycle and still measure run time in cycles per instruction, because memory is slooooooow.
I believe AMD recommended "rep stosd" for filling memory at one time. If you want to go faster still, I imagine there are SSE equivalents that store 128 or 256 bits at a go. (I haven't kept up with the latest SSE2 and SSE3. I focus on C6000-family TI DSPs.)
If you throw in "prefetch for write" instructions, you optimize the cache transfers too. I believe on AMD devices at least, it moves the line into the "O"wner state in its MOESI protocol directly, rather than waiting for the "S"hared -> "O"wner transition on the first write. (In a traditional MESI, it seems like it'd pull the line to the "E"xclusive state.)