I believe AMD recommended "rep stosd" for filling memory at one time. If you want to go faster still, I imagine there are SSE equivalents that store 128 or 256 bits at a go. (I haven't kept up with the latest SSE2 and SSE3. I focus on C6000-family TI DSPs.)
If you throw in "prefetch for write" instructions, you optimize the cache transfers too. I believe on AMD devices at least, it moves the line into the "O"wner state in its MOESI protocol directly, rather than waiting for the "S"hared -> "O"wner transition on the first write. (In a traditional MESI, it seems like it'd pull the line to the "E"xclusive state.)
Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds