I saw a similar issue with prefetch and branch prediction back when I was doing a lot of SPARC work.
Branch prediction gave us a bit of extra performance with a few code bases, but the older and better the code, they less we saw. My favorite example is Samba, so a Smarter Colleague[tm] and I looked at what was actually happening. Turns out most branches were around either short runs of legitimately conditional code or debug macros. In those cases it didn't matter if we set the prediction to correctly predict we'd branch around. The branch was very often far enough we hit a different i-cache line. Since we didn't have a way of hinting what line we'd hit, we'd slow down whenever it wasn't trivial-to-predict straight-line code.
The better and older the code, the less we would get the next i-cache line sitting waiting for us, and the slower we'd run. Grungy straight-line FORTRAN benefited fine.
I don't recollect ever seeing an actual slowdown, but we rarely could see the predicted benefits from branch prediction.
I'd venture as much as a five cent bet we'll see the same with the intel architecture.