Posted Feb 26, 2009 21:46 UTC (Thu) by jzbiciak (✭ supporter ✭, #5246)
[Link]
I came here to say pretty much the same thing. Instructions in general are waaaaay faster than memory, so caring about branch predictor performance on an "easy" case (in this case, a long-running memory fill loop) is just silly. Modern CPUs issue multiples of instructions per cycle and still measure run time in cycles per instruction, because memory is slooooooow.
I believe AMD recommended "rep stosd" for filling memory at one time. If you want to go faster still, I imagine there are SSE equivalents that store 128 or 256 bits at a go. (I haven't kept up with the latest SSE2 and SSE3. I focus on C6000-family TI DSPs.)
If you throw in "prefetch for write" instructions, you optimize the cache transfers too. I believe on AMD devices at least, it moves the line into the "O"wner state in its MOESI protocol directly, rather than waiting for the "S"hared -> "O"wner transition on the first write. (In a traditional MESI, it seems like it'd pull the line to the "E"xclusive state.)
Speeding up the page allocator
Posted Feb 27, 2009 1:08 UTC (Fri) by jzbiciak (✭ supporter ✭, #5246)
[Link]
Ah, it appears AMD K7 and beyond go one better and have a "streaming store" that doesn't even do a cache allocate. Nice.
Here's the MMX and AMD optimized copies and fills the kernel currently uses. I can't imagine they'd settle for a crappy loop here, and it looks like some thought was put into these.