Uncached memory is *far* slower than CPUs, and cache is precious and
Speeding up the page allocator
Posted Feb 26, 2009 21:46 UTC (Thu) by jzbiciak (subscriber, #5246)
I believe AMD recommended "rep stosd" for filling memory at one time. If you want to go faster still, I imagine there are SSE equivalents that store 128 or 256 bits at a go. (I haven't kept up with the latest SSE2 and SSE3. I focus on C6000-family TI DSPs.)
If you throw in "prefetch for write" instructions, you optimize the cache transfers too. I believe on AMD devices at least, it moves the line into the "O"wner state in its MOESI protocol directly, rather than waiting for the "S"hared -> "O"wner transition on the first write. (In a traditional MESI, it seems like it'd pull the line to the "E"xclusive state.)
Posted Feb 27, 2009 1:08 UTC (Fri) by jzbiciak (subscriber, #5246)
Here's the MMX and AMD optimized copies and fills the kernel currently uses. I can't imagine they'd settle for a crappy loop here, and it looks like some thought was put into these.
On regular x86, they do indeed use "rep stosl". (I guess the AT&T syntax spells it "stosl" instead of "stosd"?) See around like 92.
Rampant speculation is fun and all, but I suspect Arjan actually measured these. :-) (Or, at least the ones in the MMX file.)
Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds