Dueling performance monitors

By Jonathan Corbet
December 9, 2008

Low-level optimization of performance-critical code can be a challenging task. At this point, one assumes, the potential for algorithmic improvements in the targeted code has been realized; what is left is trying to locate and address problems like cache misses, mis-predicted branches, and so on. Such problems can be impossible to find by just looking at the code; one needs support from the hardware. The good news is that contemporary hardware provides that support; most processors can collect a wide range of performance data for analysis. The bad news is that, despite the fact that processors have been able to collect that data for many years, there has never been support for this kind of performance monitoring in the mainline kernel. That situation may be about to change, but, first, the development community will have to make a choice between a venerable out-of-tree implementation and an unexpected competitor.

The "perfmon" patch set has been under development for some years, but, for a number of reasons, it has never found its way into the mainline kernel. The most recent version of the patch was posted for review by Stéphane Eranian in late November. The perfmon patches show the signs of all those years of development work and usage experience; they offer a wide set of features and extensive user-space support. The full perfmon patch adds twelve system calls to the kernel; the posted version, though, trims that count back to five in the hope that a narrower interface will have a better chance of getting into the mainline. The additional system calls, one assumes, will be proposed for inclusion sometime after the perfmon core is merged. The reduced interface is described in the patch set; briefly, an application hooks into the performance monitoring subsystem with a call to:

    int pfm_create(int flags, pfarg_sinfo_t *regs);

This system call returns a file descriptor to identify the performance monitoring session. The regs parameter is used to return a list of performance monitoring registers available on the current system; flags is currently unused.

Specific performance counter registers can be manipulated with:

    int pfm_write(int fd, int flags, int type, void *d, size_t sz);
    int pfm_read(int fd, int flags, int type, void *d, size_t sz);

These system calls can be used to write values into registers (thus programming the performance monitoring hardware) and to read counter and configuration information from those registers.

Actually doing some performance monitoring requires a couple more calls:

    int pfm_attach(int fd, int flags, int target);
    int pfm_set_state(int fd, int flags, int state);

A call to pfm_attach() specifies which process is to be monitored; pfm_set_state() then turns monitoring on and off.

There are a couple of distinctive aspects to the perfmon interface. One is that it knows almost nothing about the specific performance monitoring registers; that information, instead, is expected to live in user space. As a result, the bare perfmon system call interface is probably not something that most monitoring applications would use; instead, those system calls are hidden behind a user-space library which knows how to program different types of processors for the desired results. Beyond that, perfmon uses the ptrace() mechanism to stop the monitored process while performance counters are being queried; as a result, the monitoring process must have the right to trace the target process.

On December 4, Thomas Gleixner and Ingo Molnar posted a surprise announcement of a new performance counter subsystem. The announcement states:

We are aware of the perfmon3 patchset that has been submitted to lkml recently. Our patchset tries to achieve a similar end result, with a fundamentally different (and we believe, superior :-) design.

This is not the first time that these developers have shown up with an out-of-the-blue reimplementation of somebody else's subsystem; other examples include the CFS scheduler, high-resolution timers, dynamic tick, and realtime preemption. Most of the time, the new code quickly supplants the older version - an occurrence which is not always pleasing to the original developers - but the situation does not seem quite as straightforward this time.

The proposed interface is much simpler, adding a single system call:

    int perf_counter_open(u32 hw_event_type, u32 hw_event_period,
                          u32 record_type, pid_t pid, int cpu);

This call will return a file descriptor corresponding to a single hardware counter. A call to read() will then return the current value of the counter. The hw_event_period can be used to block reads until the counter overflows the given value, allowing, for example, events to be queried in batches of 1000. The pid parameter can be used to target a specific process, and cpu can restrict monitoring to a specific processor.

There are a few advantages claimed for the new implementation. The simplicity of the system call interface is one of those; it is possible to write a very simple application to perform monitoring tasks, with no additional libraries required. The second version of the patch includes a simple "kerneltop" utility which can display a constantly-updated profile of anything the performance counting hardware can monitor. Another advantage is the avoidance of ptrace(); this reduces the amount of privilege needed by the monitoring process and avoids perturbing the monitored process by stopping and restarting it. The management of counters is said to be more flexible, with facilities for sharing counters between processes and reserving them for administrative access. The low-level hardware interface is said to be simpler as well.

Those claimed advantages notwithstanding, a number of complaints have been raised with regard to the new performance monitoring code. Two of those seem to be at the top of the list: the single counter per file descriptor API, and programming the hardware performance monitoring unit inside the kernel. On the API side, the biggest concern is that putting each counter behind its own file descriptor makes it very hard to correlate two or more counters. Reading two counters requires two independent read() system calls; as is always the case, just about anything could happen between those two calls. So it's hard to tell how two different counter values relate to each other. But that sort of correlation is exactly what developers doing performance optimization want to do. Paul Mackerras says:

Your API has as its central abstraction the "counter". I am saying that that is the wrong abstraction. The abstraction really needs to be a set of counters that are all active over precisely the same interval, so that their values can be meaningfully compared and related to each other.

In response, Ingo argues that the loss of precision caused by independent read() calls is small - much smaller than the muddying of the results caused by stopping the target process so that all of the counters can be read at the same time. That argument does not appear to have convinced the detractors, though.

The other complaint is that moving the counter programming task into the kernel requires that the kernel know about the complexities of every possible performance monitoring unit it may encounter. This hardware sits at the core of the most performance-critical CPU subsystems, so its design parameters value non-interference above features or a straightforward programming interface. So programming it can be a complex business, involving sizeable tables describing how various operations interact with each other. The perfmon code keeps those tables in a user-space library, but the alternative implementation won't allow that. Quoting Paul again:

Now, the tables in perfmon's user-land libpfm that describe the mapping from abstract events to event-selector values and the constraints on what events can be counted together come to nearly 29,000 lines of code just for the IBM 64-bit powerpc processors.

Your API condemns us to adding all that bloat to the kernel, plus the code to use those tables.

Paul (and others) argue that this information - which can add up to hundreds of kilobytes - is better kept in user space.

There also seems to be a bit of concern over the fact that Stéphane had clearly never heard about this work before it was posted for review. It must, indeed, be a shock to work on a subsystem for years, then find a proposed replacement sitting in one's mailbox. As David Miller put it:

And also, another part of the backlash is that the poor perfmon3 person was completely blindsided by this new stuff. Which to be honest was pretty unfair. He might have had great ideas about the requirements (even if you don't give a crap about his approach to achieving those requirements) and thus could have helped avoid the past few days of churn.

So, at this point, what will happen with performance monitoring is unclear at best. Perhaps, though, this discussion will have the effect of raising the profile of performance monitoring, which has been without proper kernel support for many years. The merging of either solution - or, perhaps, a combination of both - seems like it has to be an improvement over having no support at all.

Index entries for this article
Kernel	Performance monitoring

perfctr?

Posted Dec 9, 2008 18:38 UTC (Tue) by SiliconSlick (guest, #39955) [Link] (2 responses)

I'm assuming Mikael's Pettersson "perfctr" wasn't mentioned due to it's lack of likelihood to make it into the main-line kernel in the near future. Is that the case?

http://user.it.uu.se/~mikpe/linux/perfctr/

It's still a very useful performance monitoring API and utilized quite extensively by UT Knoxville's Performance API (PAPI), even if it never makes the big time.

perfctr?

Posted Dec 9, 2008 18:48 UTC (Tue) by BrucePerens (guest, #2510) [Link] (1 responses)

You mean the code in http://user.it.uu.se/~mikpe/linux/perfctr/2.7/UNSUPPORTED/BUG_REPORTS_FOR_ANYTHING_BUT_PPC64_WILL_BE_IGNORED/?

:-)

perfctr?

Posted Dec 9, 2008 19:10 UTC (Tue) by SiliconSlick (guest, #39955) [Link]

2.6 is his x86 release... 2.7 is PPC only... his link for "current" points to 2.6. ;)

Dueling performance monitors

Posted Dec 9, 2008 19:33 UTC (Tue) by foom (subscriber, #14868) [Link] (4 responses)

I just hope *something* manages to make it into the upstream kernel. It's really a huge PITA to have
to patch something into your kernel, just to do performance testing.

FWIW, last time I tried, perfmon2 seemed to be way too slow to be usable for fine-grained testing,
so I'm not too unhappy if it doesn't win. Perfctr w/PAPI seems the way to go at the moment.

Dueling performance monitors

Posted Dec 9, 2008 22:32 UTC (Tue) by deater (subscriber, #11746) [Link] (3 responses)

In what way is perfmon2 slow?

I find it to be easier to use and generate much better results than the alternatives.

Though I agree, as long as *some* performance monitor implementation gets merged I'll be happy. I'd prefer that it were perfmon, as I've spent a lot of work getting MIPS r10k, SPARC, Pentium Pro, Pentium II, and Athlon wokring and it would be a shame to have to go through and get yet another performance counter infrastructure up to speed with all of the platforms I need.

Dueling performance monitors

Posted Dec 10, 2008 14:52 UTC (Wed) by jreiser (subscriber, #11027) [Link] (2 responses)

Both the Eranian scheme (perfmon) and the Gleixner-Molnar scheme require a system call for each read of a counter from user mode. In contrast, the Pettersson scheme (perfctr) does not require a system call for each read of a counter from user mode on x86. Perfctr usually has much less overhead, so it is easier to obtain very fine-grained measurements.

Dueling performance monitors

Posted Dec 10, 2008 15:59 UTC (Wed) by deater (subscriber, #11746) [Link] (1 responses)

How fine-grained are you looking? If you are getting so fine-grained that an extra syscall makes a difference, then you are probably starting to run into issues with skid. Not to mention if you are trying to read multiple counters at the exact same time.

Dueling performance monitors

Posted Dec 10, 2008 21:02 UTC (Wed) by jreiser (subscriber, #11027) [Link]

As fine grain as every subroutine call and return. Some routines do encounter issues with variance due to in-flight instructions and overhead due to shortness. Usually the issues are visible and specific, and can be handled. Sometimes the profile is the proof that inlining is appropriate. A subroutine whose execution is as short as a few dozen ticks can be measured meaningfully using an automated tool based on perfctr.

Dueling performance monitors

Posted Dec 10, 2008 8:18 UTC (Wed) by njs (subscriber, #40338) [Link]

>The bad news is that, despite the fact that processors have been able to collect that data for many years, there has never been support for this kind of performance monitoring in the mainline kernel.

So, uh... why doesn't oprofile qualify? And generally, does anyone feel like explaining the relationship between oprofile and the new API(s) proposed here?

[OT] Article Width

Posted Dec 10, 2008 9:38 UTC (Wed) by mlawren (guest, #10136) [Link] (5 responses)

Anyone else having issues with the fixed width of this article? Broader than my display size.

[OT] Article Width

Posted Dec 10, 2008 15:07 UTC (Wed) by felixfix (subscriber, #242) [Link]

No problems here. Sometimes when I see what (I think) you describe, it's because the article includes a URL or quoted test which is wider than my display and the browser makes everything else fit that artificial width.

[OT] Article Width

Posted Dec 10, 2008 22:50 UTC (Wed) by roelofs (guest, #2599) [Link] (3 responses)

Anyone else having issues with the fixed width of this article? Broader than my display size.

Yes, Perens' long URL (upstream comment) horked it up.

Greg

[OT] Article Width

Posted Dec 11, 2008 23:06 UTC (Thu) by jengelh (guest, #33263) [Link] (2 responses)

firefox3 automatically splits these. But of course it would have been better to use LWN's HTML formatting capabilites, or a tinyurl.

[OT] Article Width

Posted Dec 12, 2008 21:57 UTC (Fri) by giraffedata (guest, #1954) [Link]

Well, in this case a tinyurl would have defeated the purpose of the posting.

My Opera 8 splits the URL too.

[OT] Article Width

Posted Dec 15, 2008 2:35 UTC (Mon) by roelofs (guest, #2599) [Link]

I supposed I should upgrade to FF3 one of these years...

A (reasonably) browser-independent approach would be to auto-insert <wbr> pseudo-tags either before or after slashes, commas, ampersands, etc., in "words" that exceed, say, 30 or 40 characters. Most browsers treat those as optional break locations, but they're not otherwise displayed and don't insert whitespace into cut-and-pasted copies.

Greg

Dueling performance monitors

Posted Dec 10, 2008 11:56 UTC (Wed) by tnoo (subscriber, #20427) [Link]

so they monitor each other's performance?

Dueling performance monitors

Posted Dec 10, 2008 13:33 UTC (Wed) by zooko (guest, #2589) [Link] (4 responses)

How is this "performance monitor" stuff different from oprofile?

Oprofile

Posted Dec 10, 2008 14:13 UTC (Wed) by corbet (editor, #1) [Link] (3 responses)

Oprofile is a profiler - it tells you where your program is running. Performance monitors tell you more about why it's running in a particular area. With a performance monitor, you can, for example, determine whether a reorganization of a data structure reduces cache misses or not. It's a different level of information.

Oprofile

Posted Dec 10, 2008 14:49 UTC (Wed) by zooko (guest, #2589) [Link] (2 responses)

Could you tell me more? I still don't understand what a performance monitor tells you that oprofile doesn't. I've been thinking of implementing a high-performance data structure, and I always figured that to measure such things as cache misses I would use oprofile and tell oprofile to report which instructions were executing just before cache misses.

Maybe I just need to read the documentation of these here "performance monitor" thingies.

Oprofile

Posted Dec 10, 2008 16:16 UTC (Wed) by graydon (guest, #5009) [Link] (1 responses)

The original article is incorrect when it says "there has never been support for this kind of performance monitoring in the mainline kernel". Oprofile provides access to these performance counters already, and has since mid-2.5 development. I use it every few days on stock 2.6 distro kernels. It doesn't provide, say, the BTS and PEBS buffers; but you usually don't have to go quite that far down. If you're looking for hotspots in terms of CPI, bus traffic, unusual FPU conditions, cache miss or branch mispredict counters, you're just fine with oprofile.

Perfmon is a separated "drivers and API only" layer that you can run various profilers and tools on top of. It also gets you a little further into the really hairy monitoring hardware (PEBS/BTS), beyond the event counters. Essentially it's the layer of very machine-dependent guts that (proprietary) vtune and (free) oprofile both duplicate parts of, along with a rich programmatic interface. You can run oprofile on top of perfmon if you like. Or the two can simply co-ordinate their access to the same performance counters.

For normal developers, this is mostly all plumbing. If you want to work with hardware performance counters, you've been able to via oprofile on normal linux machines for the past 5 years or so (IIRC it landed early 2003). Current oprofiles have all sorts of additional higher-level machinery (call graph profiling, a JIT API, the ability to work with xen domains, etc.)

Oprofile

Posted Dec 10, 2008 17:14 UTC (Wed) by fuhchee (guest, #40059) [Link]

If anything, the ingo/gleixner scheme seems to be solely a sampling-oriented tool, and thus in reality more of a competitor to oprofile than to perfmon.

V3 patch

Posted Dec 11, 2008 17:11 UTC (Thu) by corbet (editor, #1) [Link] (1 responses)

For those still paying attention, the V3 patch is worth a look. Among other things, it adds a "counter group" concept which is clearly meant to address the concerns of developers who want to control and query multiple counters in an atomic manner.

V3 patch

Posted Dec 19, 2008 1:36 UTC (Fri) by huaz (guest, #10168) [Link]

I am sure Ingo is very good at tuning his code so it could go in. Poor perfmon3 guy indeed.

Dueling performance monitors

Posted Dec 12, 2008 22:04 UTC (Fri) by giraffedata (guest, #1954) [Link] (1 responses)

This article would have been easier to follow with some detail as to what information these facilities extract from the hardware. Something more detailed than "support for low-level optimization of critical code." I get particularly lost trying to understand counters overflowing and reading batches of something, which is given as a feature of the Gleixner/Molnar scheme.

Anybody?

Dueling performance monitors

Posted Dec 13, 2008 14:19 UTC (Sat) by saffroy (guest, #43999) [Link]

Sometimes you want to go beyond algorithmic optimization in your program, and want to know if and how a particular piece of code could run any faster. There can be many reasons why the current code is not optimal yet: it could be causing frequent cache misses, or TLB misses, or branch prediction would not work well enough, etc. But without the hardware telling you exactly what is happening, all you can do is guess.

It's an easier game when the hardware helps you: that's why modern processors can be programmed to keep counters of events relating to performance issues, such as cache misses, or TLB misses, or branch prediction issues... Processors can also be programmed to generate an interrupt when a counter reaches a certain threshold (ie. when it "overflows"): at this point, the operating system can record which exact piece of code was running when this event occurred. Over time, you can thus accumulate statistics telling you how often your particular piece of code encounters one of the aforementioned performance problems.

Given these statistics, you can make a more educated guess as to how your code could be improved (eg. re-arrange some structure to reduce cache misses, etc).

A classic paper from Digital (1997) explains how they implemented it on their Alpha platforms:
http://www-plan.cs.colorado.edu/diwan/7135/p357-anderson.pdf

The "batches" mentioned in the article relates to the number of performance registers (counters) that can be read in one shot.

HTH