Speaking as a compiler developer, the stated problems look like possible performance bugs to me. There are always tradeoffs in compiler transformations. You hope to speed one thing up without hurting other stuff too much. Most of it is heuristic based and changing the heuristics can have dramatic effects. I have seen register allocators swing performance +-20% simply by changing the heuristic of how to pick which object to allocate next.
Incidentally, I have run into exactly the same unrolling/icache issue before. It's one of those 2nd- or 3rd-order effects you hope doesn't matter but when it does it can be a fun time tracking it down. The main problem is that it is highly context sensitive. Unrolling a loop by 4 in one place may be exactly right but may result in disaster on another loop. This is one reason compilers try to do a global analysis when possible. Of course, that's another tradeoff, one between generated code quality and compile time.
The last thing we need is another compiler mode. We have too many already. The real solution is to re-tune some of gcc's passes for modeern architectures. That is realy work, though, and takes a tone of time and patience.