It's worth noting that Dr. Weaver points out perfctr's low overhead and latency, and Mr. Gleixner counters by bringing up perfmon. They're different beasts. The former worked wonderfully for us users over many years and still has a decent user base (possibly the largest if you count by node rather than by user).
The poor PAPI folks are trying their best to keep us users productive, but I'm still running into problems that never existed before like artificially limited numbers of counters.
But, hey, NIH, "there are no users," and all that.