You'd copy because the whole "copy 4096 bytes" instruction is ONE instruction, "rep movsd"And filling is also just one instruction: "rep stosd".
Concerning speed, this stuff is probably bandwidth-limited in the usual case (when the page has cooled down for a while), so the time for the in-core execution probably does not really matter. The branch in the looping version should be very well predictable. Hmm, I think it's more likely that "rep stosd" avoids the write-allocation cache-line reads than the looping version, and that would have an effect with the page being cold. If you want to know for certain, just measure it.
About using the DMA engine, I remember (but could not find last I looked) a posting (by IIRC Linus Torvalds) many years ago that compared the Linux approach of clearing on-demand with some other OS (BSD?) that cleared pages in the idle process or something (where it costs nothing in theory). In the bottom line (i.e., when measuring application performance) the Linux approach was faster, because the page was warm in the cache afterwards, and accesses to the page did not incur cache misses. This should still hold, even with clearing by a DMA engine.
Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds