It's an easier game when the hardware helps you: that's why modern processors can be programmed to keep counters of events relating to performance issues, such as cache misses, or TLB misses, or branch prediction issues... Processors can also be programmed to generate an interrupt when a counter reaches a certain threshold (ie. when it "overflows"): at this point, the operating system can record which exact piece of code was running when this event occurred. Over time, you can thus accumulate statistics telling you how often your particular piece of code encounters one of the aforementioned performance problems.
Given these statistics, you can make a more educated guess as to how your code could be improved (eg. re-arrange some structure to reduce cache misses, etc).
A classic paper from Digital (1997) explains how they implemented it on their Alpha platforms:
The "batches" mentioned in the article relates to the number of performance registers (counters) that can be read in one shot.
Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds