Ah, it appears AMD K7 and beyond go one better and have a "streaming store" that doesn't even do a cache allocate. Nice.
Here's the MMX and AMD optimized copies and fills the kernel currently uses. I can't imagine they'd settle for a crappy loop here, and it looks like some thought was put into these.