The thing is that micro-optimizations don't always make the overall performance of the system better; they may make the function faster at the cost of leaving the cache slightly less useful. And the benchmark that focuses on that one function can see an improvement while it is causing some other concurrent task to get an additional cache miss in its main loop. Furthermore, getting the last 0.5% out of one architecture (or micro-architecture, or the code produced by one compiler version) may make others much less efficient.
I think the main trade-off is not micro-optimizations, but rather between code that just does what needs to be done and code that maintains data structures that will inform other code as to what state things are in. The old driver code, for example, set up the device, performed whatever operation was requested, and kept a small amount of state about it; it requested an interrupt from the device, and handled the interrupt when it came. The new code tracks the power management state of the device, what operations are pending that would prevent powering it off, what other hardware needs to be kept turned on to access the device, and so forth. It's a lot of bloat in the form of data that has to be stored and updated; it's also necessary to have any chance of having power management work.