Dueling performance monitors
The "perfmon" patch set has been under development for some years, but, for a number of reasons, it has never found its way into the mainline kernel. The most recent version of the patch was posted for review by Stéphane Eranian in late November. The perfmon patches show the signs of all those years of development work and usage experience; they offer a wide set of features and extensive user-space support. The full perfmon patch adds twelve system calls to the kernel; the posted version, though, trims that count back to five in the hope that a narrower interface will have a better chance of getting into the mainline. The additional system calls, one assumes, will be proposed for inclusion sometime after the perfmon core is merged. The reduced interface is described in the patch set; briefly, an application hooks into the performance monitoring subsystem with a call to:
int pfm_create(int flags, pfarg_sinfo_t *regs);
This system call returns a file descriptor to identify the performance monitoring session. The regs parameter is used to return a list of performance monitoring registers available on the current system; flags is currently unused.
Specific performance counter registers can be manipulated with:
int pfm_write(int fd, int flags, int type, void *d, size_t sz); int pfm_read(int fd, int flags, int type, void *d, size_t sz);
These system calls can be used to write values into registers (thus programming the performance monitoring hardware) and to read counter and configuration information from those registers.
Actually doing some performance monitoring requires a couple more calls:
int pfm_attach(int fd, int flags, int target); int pfm_set_state(int fd, int flags, int state);
A call to pfm_attach() specifies which process is to be monitored; pfm_set_state() then turns monitoring on and off.
There are a couple of distinctive aspects to the perfmon interface. One is that it knows almost nothing about the specific performance monitoring registers; that information, instead, is expected to live in user space. As a result, the bare perfmon system call interface is probably not something that most monitoring applications would use; instead, those system calls are hidden behind a user-space library which knows how to program different types of processors for the desired results. Beyond that, perfmon uses the ptrace() mechanism to stop the monitored process while performance counters are being queried; as a result, the monitoring process must have the right to trace the target process.
On December 4, Thomas Gleixner and Ingo Molnar posted a surprise announcement of a new performance counter subsystem. The announcement states:
This is not the first time that these developers have shown up with an out-of-the-blue reimplementation of somebody else's subsystem; other examples include the CFS scheduler, high-resolution timers, dynamic tick, and realtime preemption. Most of the time, the new code quickly supplants the older version - an occurrence which is not always pleasing to the original developers - but the situation does not seem quite as straightforward this time.
The proposed interface is much simpler, adding a single system call:
int perf_counter_open(u32 hw_event_type, u32 hw_event_period, u32 record_type, pid_t pid, int cpu);
This call will return a file descriptor corresponding to a single hardware counter. A call to read() will then return the current value of the counter. The hw_event_period can be used to block reads until the counter overflows the given value, allowing, for example, events to be queried in batches of 1000. The pid parameter can be used to target a specific process, and cpu can restrict monitoring to a specific processor.
There are a few advantages claimed for the new implementation. The simplicity of the system call interface is one of those; it is possible to write a very simple application to perform monitoring tasks, with no additional libraries required. The second version of the patch includes a simple "kerneltop" utility which can display a constantly-updated profile of anything the performance counting hardware can monitor. Another advantage is the avoidance of ptrace(); this reduces the amount of privilege needed by the monitoring process and avoids perturbing the monitored process by stopping and restarting it. The management of counters is said to be more flexible, with facilities for sharing counters between processes and reserving them for administrative access. The low-level hardware interface is said to be simpler as well.
Those claimed advantages notwithstanding, a number of complaints have been raised with regard to the new performance monitoring code. Two of those seem to be at the top of the list: the single counter per file descriptor API, and programming the hardware performance monitoring unit inside the kernel. On the API side, the biggest concern is that putting each counter behind its own file descriptor makes it very hard to correlate two or more counters. Reading two counters requires two independent read() system calls; as is always the case, just about anything could happen between those two calls. So it's hard to tell how two different counter values relate to each other. But that sort of correlation is exactly what developers doing performance optimization want to do. Paul Mackerras says:
In response, Ingo argues that the loss of precision caused by independent read() calls is small - much smaller than the muddying of the results caused by stopping the target process so that all of the counters can be read at the same time. That argument does not appear to have convinced the detractors, though.
The other complaint is that moving the counter programming task into the kernel requires that the kernel know about the complexities of every possible performance monitoring unit it may encounter. This hardware sits at the core of the most performance-critical CPU subsystems, so its design parameters value non-interference above features or a straightforward programming interface. So programming it can be a complex business, involving sizeable tables describing how various operations interact with each other. The perfmon code keeps those tables in a user-space library, but the alternative implementation won't allow that. Quoting Paul again:
Your API condemns us to adding all that bloat to the kernel, plus the code to use those tables.
Paul (and others) argue that this information - which can add up to hundreds of kilobytes - is better kept in user space.
There also seems to be a bit of concern over the fact that Stéphane had clearly never heard about this work before it was posted for review. It must, indeed, be a shock to work on a subsystem for years, then find a proposed replacement sitting in one's mailbox. As David Miller put it:
So, at this point, what will happen with performance monitoring is unclear
at best. Perhaps, though, this discussion will have the effect of raising
the profile of performance monitoring, which has been without proper kernel
support for many years. The merging of either solution - or, perhaps, a
combination of both - seems like it has to be an improvement over having no
support at all.
Index entries for this article | |
---|---|
Kernel | Performance monitoring |
Posted Dec 9, 2008 18:38 UTC (Tue)
by SiliconSlick (guest, #39955)
[Link] (2 responses)
It's still a very useful performance monitoring API and utilized quite extensively by UT Knoxville's Performance API (PAPI), even if it never makes the big time.
Posted Dec 9, 2008 18:48 UTC (Tue)
by BrucePerens (guest, #2510)
[Link] (1 responses)
:-)
Posted Dec 9, 2008 19:10 UTC (Tue)
by SiliconSlick (guest, #39955)
[Link]
Posted Dec 9, 2008 19:33 UTC (Tue)
by foom (subscriber, #14868)
[Link] (4 responses)
FWIW, last time I tried, perfmon2 seemed to be way too slow to be usable for fine-grained testing,
Posted Dec 9, 2008 22:32 UTC (Tue)
by deater (subscriber, #11746)
[Link] (3 responses)
I find it to be easier to use and generate much better results than the alternatives.
Though I agree, as long as *some* performance monitor implementation gets merged I'll be happy. I'd prefer that it were perfmon, as I've spent a lot of work getting MIPS r10k, SPARC, Pentium Pro, Pentium II, and Athlon wokring and it would be a shame to have to go through and get yet another performance counter infrastructure up to speed with all of the platforms I need.
Posted Dec 10, 2008 14:52 UTC (Wed)
by jreiser (subscriber, #11027)
[Link] (2 responses)
Posted Dec 10, 2008 15:59 UTC (Wed)
by deater (subscriber, #11746)
[Link] (1 responses)
Posted Dec 10, 2008 21:02 UTC (Wed)
by jreiser (subscriber, #11027)
[Link]
Posted Dec 10, 2008 8:18 UTC (Wed)
by njs (subscriber, #40338)
[Link]
So, uh... why doesn't oprofile qualify? And generally, does anyone feel like explaining the relationship between oprofile and the new API(s) proposed here?
Posted Dec 10, 2008 9:38 UTC (Wed)
by mlawren (guest, #10136)
[Link] (5 responses)
Posted Dec 10, 2008 15:07 UTC (Wed)
by felixfix (subscriber, #242)
[Link]
Posted Dec 10, 2008 22:50 UTC (Wed)
by roelofs (guest, #2599)
[Link] (3 responses)
Yes, Perens' long URL (upstream comment) horked it up.
Greg
Posted Dec 11, 2008 23:06 UTC (Thu)
by jengelh (guest, #33263)
[Link] (2 responses)
Posted Dec 12, 2008 21:57 UTC (Fri)
by giraffedata (guest, #1954)
[Link]
Well, in this case a tinyurl would have defeated the purpose of the posting.
My Opera 8 splits the URL too.
Posted Dec 15, 2008 2:35 UTC (Mon)
by roelofs (guest, #2599)
[Link]
A (reasonably) browser-independent approach would be to auto-insert <wbr> pseudo-tags either before or after slashes, commas, ampersands, etc., in "words" that exceed, say, 30 or 40 characters. Most browsers treat those as optional break locations, but they're not otherwise displayed and don't insert whitespace into cut-and-pasted copies.
Greg
Posted Dec 10, 2008 11:56 UTC (Wed)
by tnoo (subscriber, #20427)
[Link]
Posted Dec 10, 2008 13:33 UTC (Wed)
by zooko (guest, #2589)
[Link] (4 responses)
Posted Dec 10, 2008 14:13 UTC (Wed)
by corbet (editor, #1)
[Link] (3 responses)
Posted Dec 10, 2008 14:49 UTC (Wed)
by zooko (guest, #2589)
[Link] (2 responses)
Maybe I just need to read the documentation of these here "performance monitor" thingies.
Posted Dec 10, 2008 16:16 UTC (Wed)
by graydon (guest, #5009)
[Link] (1 responses)
The original article is incorrect when it says "there has never been support for this kind of performance monitoring in the mainline kernel". Oprofile provides access to these performance counters already, and has since mid-2.5 development. I use it every few days on stock 2.6 distro kernels. It doesn't provide, say, the BTS and PEBS buffers; but you usually don't have to go quite that far down. If you're looking for hotspots in terms of CPI, bus traffic, unusual FPU conditions, cache miss or branch mispredict counters, you're just fine with oprofile. Perfmon is a separated "drivers and API only" layer that you can run various profilers and tools on top of. It also gets you a little further into the really hairy monitoring hardware (PEBS/BTS), beyond the event counters. Essentially it's the layer of very machine-dependent guts that (proprietary) vtune and (free) oprofile both duplicate parts of, along with a rich programmatic interface. You can run oprofile on top of perfmon if you like. Or the two can simply co-ordinate their access to the same performance counters. For normal developers, this is mostly all plumbing. If you want to work with hardware performance counters, you've been able to via oprofile on normal linux machines for the past 5 years or so (IIRC it landed early 2003). Current oprofiles have all sorts of additional higher-level machinery (call graph profiling, a JIT API, the ability to work with xen domains, etc.)
Posted Dec 10, 2008 17:14 UTC (Wed)
by fuhchee (guest, #40059)
[Link]
Posted Dec 11, 2008 17:11 UTC (Thu)
by corbet (editor, #1)
[Link] (1 responses)
Posted Dec 19, 2008 1:36 UTC (Fri)
by huaz (guest, #10168)
[Link]
Posted Dec 12, 2008 22:04 UTC (Fri)
by giraffedata (guest, #1954)
[Link] (1 responses)
This article would have been easier to follow with some detail as to what information these facilities extract from the hardware. Something more detailed than "support for low-level optimization of critical code."
I get particularly lost trying to understand counters overflowing and reading batches of something, which is given as a feature of the Gleixner/Molnar scheme.
Anybody?
Posted Dec 13, 2008 14:19 UTC (Sat)
by saffroy (guest, #43999)
[Link]
It's an easier game when the hardware helps you: that's why modern processors can be programmed to keep counters of events relating to performance issues, such as cache misses, or TLB misses, or branch prediction issues... Processors can also be programmed to generate an interrupt when a counter reaches a certain threshold (ie. when it "overflows"): at this point, the operating system can record which exact piece of code was running when this event occurred. Over time, you can thus accumulate statistics telling you how often your particular piece of code encounters one of the aforementioned performance problems.
Given these statistics, you can make a more educated guess as to how your code could be improved (eg. re-arrange some structure to reduce cache misses, etc).
A classic paper from Digital (1997) explains how they implemented it on their Alpha platforms:
The "batches" mentioned in the article relates to the number of performance registers (counters) that can be read in one shot.
HTH
perfctr?
http://user.it.uu.se/~mikpe/linux/perfctr/
You mean the code in http://user.it.uu.se/~mikpe/linux/perfctr/2.7/UNSUPPORTED/BUG_REPORTS_FOR_ANYTHING_BUT_PPC64_WILL_BE_IGNORED/?perfctr?
perfctr?
Dueling performance monitors
to patch something into your kernel, just to do performance testing.
so I'm not too unhappy if it doesn't win. Perfctr w/PAPI seems the way to go at the moment.
Dueling performance monitors
Dueling performance monitors
Dueling performance monitors
Dueling performance monitors
Dueling performance monitors
[OT] Article Width
[OT] Article Width
Anyone else having issues with the fixed width of this article? Broader than my display size.
[OT] Article Width
[OT] Article Width
[OT] Article Width
I supposed I should upgrade to FF3 one of these years...
[OT] Article Width
Dueling performance monitors
Dueling performance monitors
Oprofile is a profiler - it tells you where your program is running. Performance monitors tell you more about why it's running in a particular area. With a performance monitor, you can, for example, determine whether a reorganization of a data structure reduces cache misses or not. It's a different level of information.
Oprofile
Oprofile
Oprofile
Oprofile
For those still paying attention, the V3 patch is worth a look. Among other things, it adds a "counter group" concept which is clearly meant to address the concerns of developers who want to control and query multiple counters in an atomic manner.
V3 patch
V3 patch
Dueling performance monitors
Dueling performance monitors
http://www-plan.cs.colorado.edu/diwan/7135/p357-anderson.pdf