> can't see how adding more registers could ever legitimately make a
> CPU-bound 32-bit program *slower*...?
Several things could conspire to make this happen, besides the lack of optimization already noted.
- Function calls are more expensive due to additional callee-save registers.
- Systems calls are more expensive due to larger context save and restore.
- Things like setjmp/longjmp are slower for the same reason.
- Longer instruction encoding causes icache pressure.
Then there are all sorts of microarchitecture changes resulting from the ISA additions that can reduce clock-for-clock performance. Things like longer pipelines to compensate for more complicated instruction decoding, though these are likely secondary at best.