LWN.net Logo

Raw events and the perf ABI

By Jonathan Corbet
May 3, 2011
The perf events subsystem often looks like it's on the path to take over the kernel; there is a great deal of development activity there, and it has become a sort of generalized event reporting mechanism. But the original purpose of perf events was to provide access to the performance monitoring counters made available by the hardware, and it is still used to that end. The merging of perf was a bit of a hard pill for users of alternative performance monitoring tools to swallow, but they have mostly done so. The recent discussion on "offcore" events shows that there are still some things to argue about in this area, even if everybody seems likely to get what they want in the end.

The performance monitoring unit (PMU) is normally associated with the CPU; each processor has its own PMU for monitoring its own specific events. Some newer processors (such as Intel's Nehalem series) also provide a PMU which is not tied to any CPU; in the Nehalem case it's part of the "uncore" which handles memory access at the package level. The off-core PMU has a viewpoint which allows it to provide a better picture of the overall memory behavior of the system, so there is interest in gaining access to events from that PMU. Current kernels, though, do not provide access to these offcore events.

For a while, the 2.6.39-rc kernel did provide access to these events, following the merging of a patch by Andi Kleen in March. One piece that was missing, though, was a patch to the user-space perf tool to provide access to this functionality. There was an attempt to merge that piece toward the end of April, but it did not yield the desired results; rather than merge the additional change, perf maintainer Ingo Molnar removed the ability to access offcore events entirely.

Needless to say, that action has led to some unhappiness in the perf user community; there are developers who had already been making use of those events. Normally, breaking things in this way would be considered a regression, and the patch would be backed out again. But, since this functionality never appeared in a released kernel, it cannot really be called a regression. That, of course, is part of the point of removing the feature now.

Ingo's complaint is straightforward: the interface to these events was too low-level and too difficult to use. The rejected perf patch had an example command which looked like:

    perf stat -e r1b7:20ff -a sleep 1

Non-expert readers may, indeed, be forgiven for not immediately understanding that this command would monitor access to remote DRAM - memory which is hosted on a different socket. Ingo asserted that the feature should be more easily used, perhaps with a command like (from the patch removing the feature):

    perf record -e dram-remote ./myapp

He also said:

But this kind of usability is absolutely unacceptable - users should not be expected to type in magic, CPU and model specific incantations to get access to useful hardware functionality.

The proper solution is to expose useful offcore functionality via generalized events - that way users do not have to care which specific CPU model they are using, they can use the conceptual event and not some model specific quirky hexa number.

The key is the call for "generalized events" which are mapped, within the kernel, onto whatever counters the currently-running hardware uses to obtain that information. Users need not worry about the exact type of processor they are running on, and they need not dig out the data sheet to figure out what numbers will yield the results they want.

Criticism of this move has taken a few forms. Generalized events, it is said, are a fine thing to have, but they can never reflect all of the weird, hardware-specific counters that each processor may provide. These events should also be managed in user space where there is more flexibility and no need to bloat the kernel. There were some complaints about how some of the existing generalized events have not always been implemented correctly on all architectures. And, they say, there will always be people who want to know what's in a specific hardware counter without having the kernel trying to generalize it away from them. As Vince Weaver put it:

Blocking access to raw events is the wrong idea. If anything, the whole "generic events" thing in the kernel should be ditched. Wrong events are used at times (see AMD branch events a few releases back, now Nehalem cache events). This all belongs in userspace, as was pointed out at the start. The kernel has no business telling users which perf events are interesting, or limiting them!

Ingo's response is that the knowledge and techniques around performance monitoring should be concentrated in one place:

Well, the thing is, i think users are helped most if we add useful, highlevel PMU features added and not just an opaque raw event pass-through engine. The problem with lowlevel raw ABIs is that the tool space fragments into a zillion small hacks and there's no good concentration of know-how. I'd like the art of performance measurement to be generalized out, as well as it can be.

Vince, meanwhile, went on to claim that perf was a reinvention of the wheel which has ignored a lot of the experience built into its predecessors. There are, it seems, still some scars from that series of events. Thomas Gleixner disagreed with the claim that perf is an exercise in wheel reinvention, but he did say that he thought the raw events should be made available:

The problem at hand which ignited this flame war is definitely borderline and I don't agree with Ingo that it should not made be available right now in the raw form. That's an hardware enablement feature which can be useful even if tools/perf has not support for it and we have no generalized event for it. That's two different stories. perf has always allowed to use raw events and I don't see a reason why we should not do that in this case if it enables a subset of the perf userbase to make use of it.

It turns out that Ingo is fine with raw events too. His stated concern is that access to raw events should not be the primary means by which most users gain access to those performance counters. So he is blocking the availability of those events for now for two reasons. One of those is that he wants the generalized mode of access to be available first so that users will see it as the normal way to access offcore events. If there is never any need to figure out hexadecimal incantations, many user-space developers will never bother; as a result, their commands and code should eventually work on other processors as well.

The other reason for blocking raw events now is that, as the interface to these events is thought through, the ABI by which they are exposed to user space may need to change. Releasing the initial ABI in a stable kernel seems almost certain to cement it in place, given that people were already using it. By deferring these events for one cycle (somebody will certainly come up with a way to export them in 2.6.40), he hopes to avoid being stuck with a second-rate interface which has to be supported forever.

There can be no doubt that managing this feature in this way makes life harder for some developers. The kernel process can be obnoxious to deal with at times. But the hope is that doing things this way will lead to a kernel that everybody is happier with five years from now. If things work out that way, most of us can deal with a bit of frustration and a one-cycle delay now.


(Log in to post comments)

Raw events and the perf ABI

Posted May 5, 2011 13:44 UTC (Thu) by ejr (subscriber, #51652) [Link]

It's worth noting that Dr. Weaver points out perfctr's low overhead and latency, and Mr. Gleixner counters by bringing up perfmon. They're different beasts. The former worked wonderfully for us users over many years and still has a decent user base (possibly the largest if you count by node rather than by user).

The poor PAPI folks are trying their best to keep us users productive, but I'm still running into problems that never existed before like artificially limited numbers of counters.

But, hey, NIH, "there are no users," and all that.

Raw events and the perf ABI

Posted May 5, 2011 13:47 UTC (Thu) by ejr (subscriber, #51652) [Link]

AUGH! And now I find this: http://article.gmane.org/gmane.linux.kernel/1132912

Well, at least now I know why some of my performance results have been so utterly mystifying. Wow. Think I'll go back to perfctr as best I can.

Raw events and the perf ABI

Posted May 5, 2011 18:52 UTC (Thu) by deater (subscriber, #11746) [Link]

To be fair, PAPI changes predefined event definitions from time to time too.

One problem with the perf_events solution is that as far as I know it is not possible to find out what their "generalized events" map to without obtaining and inspecting the kernel source (and all bets are off if the source is modified and you don't know how). At least in PAPI there's a way from the interface to find out what native events underly the predefined ones.

A lot of the user/kernel issue comes down to how you use your machine. If you are a kernel developer who reboots into new kernels hourly, then putting stuff like this in the kernel is best. However if you are a user at a university or on a supercomputer where older distro kernels are used (and the kernel is updated rarely)... then you are out of luck.

For example, the new perf_events PERF_COUNT_HW_STALLED_CYCLES generalized event won't be available until possibly 2.6.40. You cannot use it unless you run that kernel or newer. And if you compile code against newer headers that have that event available and try to run it, say, on a 2.6.32 kernel... it will fail.

Wheras if you use PAPI which does the generalizing of counters in userspace, there is no backward compatability issue. The event comes with the library; a library that a regular user can install in their home directory. As long as the kernel being run provides a raw event interface for the CPU you have, it doesn't matter how old the kernel is. In fact PAPI predefined events are in theory compatible across kernel versions as well as across different _operating systems_.

Raw events and the perf ABI

Posted May 5, 2011 20:19 UTC (Thu) by ejr (subscriber, #51652) [Link]

Exactly. It's about a month turn-around between possible kernel-level fixes on the relevant small-ish machine here. Other places (e.g. NERSC & ORNL), well, forget it. Rebooting a 19k node machine isn't taken lightly. Not supporting raw access to the different levels makes using performance counters *more* difficult, not less. And calling out-of-tree userspace interfaces non-existent is dishonest.

Raw events and the perf ABI

Posted May 5, 2011 18:31 UTC (Thu) by deater (subscriber, #11746) [Link]

Good article. One subtle thing to note is that "offcore" and "uncore" events, while they might sound related, are very different in practice.

Offcore Response events let you gather counts for cache results that have gone "offcore". There is a separate MSR (outside the normal MSR range) that you can set that will filter these cache results based on exactly where the offcore cache request was handled.

Much of the debate referenced in this article was sparked because this extra MSR was newly exposed via the "config1" field in the perf_event_attr structure in the perf_events.h header file. (It still is, just now your program will fail if you try to program that field).

Uncore events are events handled by a separate PMU that handles chip-wide (rather than per-core) counters. This includes things like the memory controller, interconnect logic, and the last-level cache. To support this properly the kernel will have to be patched in much more intrusive ways (to support having multiple PMUs active at once, and to also figure out complicated issues. Such as: an uncore-PMU counteroverflowed. Which of the cores gets notified? Which cores is it even relevant too? A tricky problem).

As per Ingo's complaint that you have to pass dense hex values to perf... that's a failing with perf. You can use the very nice "libpfm4" library to use nice names for these events as documented in the various vendor manuals. There have been patches to have perf link against libpfm4, but it doesn't seem to have gone anywhere.

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds