Not logged in
Log in now
Create an account
Subscribe to LWN
Dividing the Linux desktop
LWN.net Weekly Edition for June 13, 2013
A report from pgCon 2013
Little things that matter in language design
LWN.net Weekly Edition for June 6, 2013
mov 4096, %ecx
mov 0, [addr+%ecx]
loop @@j0 ; dec %ecx + jnz @@j0
mov srcaddr, %esi ; Source
mov dstaddr, %edi ; Destination
mov 1024, %ecx ; Number to copy
rep movsd ; Copy 4-byte double-words
This is easily pipelined, internalized, and also runs 1/4 the operations. Unfortunately it requires a big sweep through two pages of data; the CPU's cache algorithms should keep this up to date, but will probably destroy 8k of cache in the process.
Speeding up the page allocator
Posted Feb 26, 2009 4:28 UTC (Thu) by agrover (subscriber, #55381)
You can use it to zero the page and not touch CPU cache.
Posted Feb 26, 2009 12:09 UTC (Thu) by nix (subscriber, #2304)
Posted Feb 26, 2009 14:35 UTC (Thu) by email@example.com (guest, #38022)
Posted Feb 26, 2009 20:17 UTC (Thu) by bluefoxicy (guest, #25366)
Posted Feb 27, 2009 1:21 UTC (Fri) by jzbiciak (✭ supporter ✭, #5246)
While it's true that zeroing all freed pages is a bad idea, keeping a pool of freed pages that's refilled during idle periods isn't so crazy. I believe the Windows NT kernel does something along those lines. You do end up putting more code in the fast-path to detect whether the "prezeroed pool" is non-empty, and it only applies to GFP_ZERO pages anyway, so I suspect it ends up not being a win under Linux.
Mel's patches bring a noticeable speedup at the benchmark level, and suggest to me that GFP_ZERO pages are not the most numerous allocations in the system. This makes intuitive sense--most allocations back other higher-level allocators in the kernel and/or provide buffer space that's about to be filled. There's no reason to zero it. Complicating those allocations for a minor speedup in GFP_ZERO allocations seems misplaced.
Posted Feb 27, 2009 10:26 UTC (Fri) by firstname.lastname@example.org (guest, #38022)
Posted Feb 27, 2009 23:14 UTC (Fri) by nix (subscriber, #2304)
IIRC the zero page was removed from the kernel because zeroing pages was
faster than doing pagetable tricks to share a single zero page. Pagetable
manipulation is particularly expensive, but even so...
we have /dev/zero, why not use the hardware implementation?
Posted Mar 4, 2009 8:03 UTC (Wed) by xoddam (subscriber, #2322)
Posted Feb 26, 2009 14:42 UTC (Thu) by jzbiciak (✭ supporter ✭, #5246)
If you want to get really fancy on a modern ISAs but not touch DMA engines, you'd use the various prefetch-for-write and streaming write instructions that write 128 bits or more at a go. (I'm not limiting myself to x86 and SSE variants here.)
Posted Feb 26, 2009 20:23 UTC (Thu) by bluefoxicy (guest, #25366)
You'd copy because the whole "copy 4096 bytes" instruction is ONE instruction, "rep movsd" (or "rep movsb" which is probably internally optimized to operate on words except for non-word-aligned start/end data). The entire loop logic is internalized on the CPU, and there's no stepping through macroinstructions like "cmp," "jnz," "dec," or "loop"
The assumption here is that the CPU's internal microcode for running a loop is a lot faster than stepping through two instructions:
rep movsd ; Copy based on registers %esi, %edi, %ecx
@j00: ; label, not an instruction
mov 0,[addr+%ecx] ; Write 0x00 to addr+offset
loop ; dec %ecx && jnz @j00
One of these involves branching, and thus branch prediction. One of these involves cache, and thus prefetching... but also works internally. Which is faster?
Posted Feb 26, 2009 20:46 UTC (Thu) by bcopeland (subscriber, #51750)
I admit I haven't kept up with the ISAs since pentium era, but for a while the rep functions were in fact slower than open-coded loops. Anyway if it were true that rep movs was faster than dec/jmp, there is rep stosd which does the same thing but without copying.
Posted Feb 27, 2009 0:00 UTC (Fri) by bluefoxicy (guest, #25366)
Posted Feb 26, 2009 21:24 UTC (Thu) by nix (subscriber, #2304)
Uncached memory is *far* slower than CPUs, and cache is precious and
Posted Feb 26, 2009 21:46 UTC (Thu) by jzbiciak (✭ supporter ✭, #5246)
I believe AMD recommended "rep stosd" for filling memory at one time. If you want to go faster still, I imagine there are SSE equivalents that store 128 or 256 bits at a go. (I haven't kept up with the latest SSE2 and SSE3. I focus on C6000-family TI DSPs.)
If you throw in "prefetch for write" instructions, you optimize the cache transfers too. I believe on AMD devices at least, it moves the line into the "O"wner state in its MOESI protocol directly, rather than waiting for the "S"hared -> "O"wner transition on the first write. (In a traditional MESI, it seems like it'd pull the line to the "E"xclusive state.)
Posted Feb 27, 2009 1:08 UTC (Fri) by jzbiciak (✭ supporter ✭, #5246)
Here's the MMX and AMD optimized copies and fills the kernel currently uses. I can't imagine they'd settle for a crappy loop here, and it looks like some thought was put into these.
On regular x86, they do indeed use "rep stosl". (I guess the AT&T syntax spells it "stosl" instead of "stosd"?) See around like 92.
Rampant speculation is fun and all, but I suspect Arjan actually measured these. :-) (Or, at least the ones in the MMX file.)
Posted Feb 26, 2009 22:45 UTC (Thu) by iabervon (subscriber, #722)
Posted Feb 27, 2009 0:01 UTC (Fri) by bluefoxicy (guest, #25366)
As another poster said, rep may or may not be faster/slower than open coded loops.
Posted Feb 28, 2009 17:53 UTC (Sat) by anton (guest, #25547)
You'd copy because the whole "copy 4096 bytes" instruction is ONE instruction, "rep movsd"
Concerning speed, this stuff is probably bandwidth-limited in the
usual case (when the page has cooled down for a while), so the time
for the in-core execution probably does not really matter. The branch
in the looping version should be very well predictable. Hmm, I think
it's more likely that "rep stosd" avoids the write-allocation
cache-line reads than the looping version, and that would have an
effect with the page being cold. If you want to know for certain, just
About using the DMA engine, I remember (but could not find last I
looked) a posting (by IIRC Linus Torvalds) many years ago that
compared the Linux approach of clearing on-demand with some other OS
(BSD?) that cleared pages in the idle process or something (where it
costs nothing in theory). In the bottom line (i.e., when measuring
application performance) the Linux approach was faster, because the
page was warm in the cache afterwards, and accesses to the page did
not incur cache misses. This should still hold, even with clearing by
a DMA engine.
Posted Mar 5, 2009 8:18 UTC (Thu) by efexis (guest, #26355)
Posted Mar 5, 2009 8:37 UTC (Thu) by jzbiciak (✭ supporter ✭, #5246)
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds