User events — but not quite yet

By Jonathan Corbet
April 18, 2022

The ftrace and perf subsystems provide visibility into the workings of the kernel; by activating existing tracepoints, interested developers can see what is happening at specific points in the code. As much as kernel developers may resist the notion, though, not all events of interest on a system happen within the kernel. Administrators will often want to look inside user-space processes as well; they would be even happier with a mechanism that allows the simultaneous tracing of events in both the kernel and user space. The user-events subsystem, developed by Beau Belgrave and added during the 5.18 merge window, promises that capability, but users will almost certainly have to wait another cycle to gain access to it.

Kernel tracepoints are hooks at specific locations in the code. They are designed to add as little overhead as possible when they are not active, which is the case most of the time. When a tracepoint is activated, it produces a stream of structured data specific to the event being monitored; user space can read that data via a number of different interfaces. By turning on just the tracepoints of interest, user space can collect the data needed to analyze a specific situation without slowing down the kernel overall.

The user-events ABI

Belgrave's user-space equivalent to kernel-space tracepoints, merged for 5.18, requires a bit more work to support, though libraries provided in the future may ease some of that burden. The first step is to open a new file added to the tracefs kernel filesystem:

    /sys/kernel/debug/tracing/user_events_data

A program then needs to register each event that it wishes to make available to the system. That is done by filling out this structure:

    struct user_reg {
        u32 size;
        u64 name_args;
        u32 status_index;
        u32 write_index;
    };

The first two members are input parameters, while the last two are set by the kernel. The size parameter should just be the size of the user_reg structure itself; this helps to ensure compatibility if the structure grows in future kernel releases. The event itself is described by name_args, which is a pointer to a string; it uses a special format added with this patch set. The first token is the name of the event; the rest of the line describes the data reported for that event. So an event that reports an integer named level and a 20-character string named badness could be described as:

   my-event u32 level; char[20] badness

This structure is then registered with an ioctl() operation on the previously opened user_events_data file, using the DIAG_IOCSREG command. On successful registration, the kernel will store two index values in status_index and write_index, the use of which will be described below.

Once the event is registered, it will show up in tracefs under the user_events subsystem. That means it can be activated, and its data collected, using any of the usual user-space tools. But to get there, the application must still provide that data when the time comes.

To do that, the program should open the other new tracefs file as well:

    /sys/kernel/debug/tracing/user_events_status

That file should then be mapped into the program's address space with an mmap() call.

Like its kernel counterpart, the user-events mechanism has been designed to minimize its overhead when nobody is interested in the events. So the program implementing the events will only want to provide the data if it has been requested. The user_events_status file that was just mapped above will contain a byte of data indicating whether the event is active or not; its index will be the value stored in the status_index field during registration. If that byte is zero, the event is not active and the program should not output any data; that is expected to be the case most of the time.

When somebody attaches to the event, the associated byte will no longer read as zero. It is, in fact, a bitmap giving information about how the event has been attached; one bit corresponds to ftrace, while another is for perf. When the program sees that non-zero byte, it should write the data associated with the event to the user_events_data file opened at the beginning. The first four bytes of the written data should be the value the kernel stored in write_index at registration time; the rest will be the data as described. Typically, a writev() call will be needed to assemble the requisite bits.

That describes the bulk of the API. More information can be found in this documentation commit and this sample program. There is also, inevitably, a way to attach BPF programs to user events, but that feature is not described in detail in the documents.

Concerns

After this code was merged, Linux Trace Toolkit (LTTng) developer Mathieu Desnoyers posted some criticisms of the new interface. The byte-based status mechanism struck him as inefficient; providing a single bit for each event would allow for a more compact representation and, thus, better cache utilization. The multiple bits of information indicating how the events had been attached to have no real value to the application being traced, which should produce the same data regardless.

He had some other concerns as well. If the page(s) containing the data to be written for an event are forced out of memory, the resulting page fault will cause writev() to fail and, absent active countermeasures, the event data will be lost. The mechanism as a whole is built around access to tracing data via the kernel; it will only add overhead when purely user-space tracers (such as LTTng) are in use. There were a number of implementation concerns as well.

Desnoyers also brought the facility to the attention of BPF maintainer Alexei Starovoitov, who had been unaware of it. He was not happy with what he saw; he called for the BPF mechanism to be removed immediately: "It's a hard Nack to add a bpf interface to user_events". He has reiterated that position in subsequent discussions.

Belgrave quickly posted a patch removing the BPF feature, as requested. But it looks like that will not be enough for this feature to be enabled in 5.18. Tracing maintainer Steven Rostedt stated his agreement with Desnoyers, saying that he is considering marking the whole mechanism as "broken" so that the issues can be resolved. It is conceivable that Belgrave could address all of the concerns in this development cycle, but it is unlikely; that sort of work is not meant to go into the mainline after the merge window closes. So, chances are, users will have to wait until 5.19 for access to the new user-events tracing mechanism.

Index entries for this article
Kernel	BPF/Tracing
Kernel	Tracing

User events — but not quite yet

Posted Apr 18, 2022 22:33 UTC (Mon) by pbonzini (subscriber, #60935) [Link] (3 responses)

I'm confused, why was this included at all in a pull request? Also Starovoitov's message (https://lwn.net/ml/linux-kernel/CAADnVQK=GCuhTHz=iwv0r7Y3...) completely lacks a description of _why_ the eBPF interface is bad.

All maintainers have a bad day every now and then, but this really looks like a bad example of kernel development.

User events — but not quite yet

Posted Apr 19, 2022 4:08 UTC (Tue) by alison (subscriber, #63752) [Link] (1 responses)

Why is the new mechanism better than uprobes, which have easy-to-use BPF support via the USDT facility in the folly library:

https://github.com/facebook/folly/tree/main/folly/tracing

Compiling against all of folly to get the userspace tracing would be a pain, but the code is Apache-licensed, so perhaps that is not necessary.

User events — but not quite yet

Posted Apr 20, 2022 6:27 UTC (Wed) by lathiat (subscriber, #18567) [Link]

Note that USDT probes can also be defined using headers from systemtap (sys/sdt.h - systemtap-sdt-dev{el,}) but it doesn't actually require systemtap to use it's just where the headers live.

This library generally is GPLv2-only licensed but the relevant sys/sdt.h file is "dedicated to the public domain, pursuant to CC0 (https://creativecommons.org/publicdomain/zero/1.0/)"

It uses the same definitions and is source-compatible with the definitions from DTRACE (so the macros are named DTRACE_*):
https://sourceware.org/systemtap/wiki/AddingUserSpaceProb...
https://sourceware.org/systemtap/wiki/UserSpaceProbeImple...

User events — but not quite yet

Posted Apr 20, 2022 8:05 UTC (Wed) by net_benji (subscriber, #75195) [Link]

Alexei gave reasons for his disapproval in another thread:
https://lore.kernel.org/linux-trace-devel/CAADnVQJFjXDvqM...

> The whole user_events feature looks redundant to me.
> We have uprobes and usdt. It doesn't look to me that
> user_events provide anything new that wasn't available earlier.

User events — but not quite yet

Posted Apr 19, 2022 18:01 UTC (Tue) by geert (subscriber, #98403) [Link]

The first thing that struck me was the implicit gap between the size and name_args fields. The gap may or may not be present, depending on the alignment rules of an architecture.

User events — but not quite yet

Posted Apr 27, 2022 18:01 UTC (Wed) by anton (subscriber, #25547) [Link]

It seems to me that the proposed mechanism is expensive when an event is actually monitored: one writev() system call per event. This will limit the uptake even if it gets accepted.