LWN.net Logo

Using the perf code to create a RAS daemon

By Jake Edge
February 2, 2011

Monitoring a system for "reliability, availability, and serviceability" (RAS) is an important part of keeping that system, or a cluster of such computers, up and running. There is a wide variety of things that could be monitored for RAS purposes—memory errors, CPU temperature, RAID and filesystem health, and so on—but Borislav Petkov's RAS daemon is just targeted at gathering information on any machine check exceptions (MCEs) that occur. The daemon uses trace events and the perf infrastructure, which requires a fair amount of restructuring of that code to make it available not only to the RAS daemon, but also for other kinds of tools.

The first step is to create persistent perf events, which are events that are always enabled, and will have event buffers allocated, even if there is no process currently looking at the data. That allows the MCE trace event to be enabled at boot time, before there is any task monitoring the perf buffer. Once the boot has completed, the RAS daemon (or some other program) can mmap() the event buffers and start monitoring the event. This will allow the RAS daemon to pick up any MCE events that happened during the boot process.

To do that, the struct perf_event_attr gets a new persistent bitfield that is used to determine whether or not to destroy the event buffers when they are unmapped. In addition, persistent events can be shared by multiple monitoring programs because they can be mapped as shared and read-only. Once the persistent events are added, the next patch then changes the MCE event to become a persistent event.

With that stage set, Petkov then starts to rearrange the perf code so that the RAS daemon and other tools can access some of the code that is currently buried in the tools/perf directory. That includes things like the trace event utilities, which move from tools/perf/util to tools/lib/trace and some helper functions for debugfs that move to tools/lib/lk. These were obviously things that were needed when creating the RAS daemon, but not easily accessible.

A similar patch moves the mmap() helper functions from the tools/perf directory to another new library: tools/lib/perf. These functions handle things like reading the head of the event buffer queue, writing at the tail of the queue, and reading and summing all of the per-cpu event counters for a given event.

In response to the patch moving the mmap() helpers, Arnaldo Carvalho de Melo pointed out that he had already done some work to rework that code, and that it would reduce the size of Petkov's patch set once it gets merged into the -tip tree. He also noted that he had created a set of Python bindings and a simple perf-event-consuming twatch daemon using those bindings. While Petkov had some reasons for writing the RAS daemon in C rather than Python, mostly so that it would work on systems without Python or with outdated versions, he did seem impressed: "twatch looks almost like a child's play and even my grandma can profile her system now :)."

But the Python bindings aren't necessarily meant for production code, as Carvalho de Melo describes. Because the Python bindings are quite similar to their C counterparts, they can be used to ensure that the kernel interfaces are right:

I.e. one can go on introducing the kernel interfaces and testing them using python, where you can, for instance, from the python interpreter command line, create counters, read its values, i.e. test the kernel stuff quickly and easily.

Moving to a C version then becomes easy after the testing phase is over and the kernel bits are set in stone.

There are some additional patches that move things around within the tools tree before the final patch actually adds the RAS daemon. The daemon is fairly straightforward, with the bulk of it being boilerplate daemon-izing code. The rest parses the MCE event format (from mce/mce_record/format file in debugfs), then opens and maps the debugfs mce/mce_recordN files (where N is the CPU number). The main program sits in a loop checking for MCE events every 30 seconds, printing the CPU, MCE status, and address for any events that have occurred to a log file. Petkov mentions decoding of the MCE status as something that he is currently working on.

Obviously, the RAS daemon itself is not the end result Petkov is aiming for. Rather, it is just a proof-of-concept for persistent events and demonstrates one way to rearrange the perf code so that other tools can use it. There may be disagreements about the way the libraries were arranged, or the specific location of various helpers, but the overall goal seems like a good one. Whether tools like ras actually end up in the kernel tree is, perhaps, questionable—the kernel hackers may not want to maintain a bunch of tools of this kind—but by making the utility code more accessible, it will make it much easier for others build these tools on their own.


(Log in to post comments)

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds