Sometimes you want to go beyond algorithmic optimization in your program, and want to know if and how a particular piece of code could run any faster. There can be many reasons why the current code is not optimal yet: it could be causing frequent cache misses, or TLB misses, or branch prediction would not work well enough, etc. But without the hardware telling you exactly what is happening, all you can do is guess.
It's an easier game when the hardware helps you: that's why modern processors can be programmed to keep counters of events relating to performance issues, such as cache misses, or TLB misses, or branch prediction issues... Processors can also be programmed to generate an interrupt when a counter reaches a certain threshold (ie. when it "overflows"): at this point, the operating system can record which exact piece of code was running when this event occurred. Over time, you can thus accumulate statistics telling you how often your particular piece of code encounters one of the aforementioned performance problems.
Given these statistics, you can make a more educated guess as to how your code could be improved (eg. re-arrange some structure to reduce cache misses, etc).