<i>The eliminated instruction is one byte long, executes very quickly, and string instructions
are not very common in most real code anyway. When they occur, they are heavyweight
operations, because the sources and count have to be set up into particular registers, and the
string instruction itself usually takes much more time than simple instructions. Whether or
not the direction flag instruction appears might then change the time of the string operation
by perhaps 1% or less. So except for contrived programs that consist almost entirely of these
string operations, I suspect it is impossible to measure any execution time reduction in
actual programs that could be attributed to this compiler change.</i>
Unfortunately, you're wrong. CLD can have a latency of 50+ cycles on some x86 implementations:
that's not an insignificant amount. Plus we're not just talking about "string operations",
we're talking about functions like memset() & memcpy() too, which often use them.
See: http://gcc.gnu.org/ml/gcc/2008-03/msg00360.html for some benchmarks
and http://gcc.gnu.org/ml/gcc/2008-03/msg00404.html for a link to a document which gives a
latency of 52 cycles for CLD.