Relief for insomniac tracepoints

By Jonathan Corbet
October 29, 2020

The kernel's tracing infrastructure is designed to be fast and to interfere as little as possible with the normal operation of the system. One consequence of this requirement is that the code that runs when a tracepoint is hit cannot sleep; otherwise execution of the tracepoint could add an arbitrary delay to the execution of the real work the kernel should be doing. There are times, though, that the ability to sleep within a tracepoint would be handy, delays notwithstanding. The sleepable tracepoints patch set from Michael Jeanson sets the stage to make it possible for (some) tracepoint handlers to take a nap while performing their tasks — but stops short of completing the job for now.

Within the kernel, the tracing machinery has no need to sleep; its task is normally to package up the data associated with a given tracepoint and place the result into a ring buffer for transport to user space. This work can be accomplished without the need to wait for any outside events. The use cases driving the push for sleepable tracepoints thus must come from elsewhere — from BPF programs attached to tracepoints by user space, in particular. These programs are currently limited to accessing data in kernel space, which can always be done without the need to sleep. There would be value, though, in the ability to look at user-space data in a tracepoint handler as well. This data is not guaranteed to be resident in RAM when the handler tries to access it; should it not be present, a page fault will result. Handling page faults can take an arbitrary amount of time, during which the faulting process must be put to sleep.

In current kernels, this possibility prevents access to user-space data from tracepoint handlers. Specifically, it means that tracers cannot dereference pointers passed from user space. Thus, for example, a tracepoint running on entry to the openat2() system call can see the pointer to the open_how structure passed by user space, but is unable to examine the contents of the structure itself.

There is nothing about tracepoints that inherently makes sleeping impossible — at least, for those tracepoints that are executed when the kernel is not running in atomic context. But the BPF subsystem has long had its own rule that BPF programs could not sleep. That will change in the 5.10 kernel, though, thanks to the addition of sleepable BPF programs, which no longer have this constraint. Only certain types of BPF programs are allowed to block; in 5.10, tracing programs are on that list. There will be no users of this ability in the 5.10 release, though.

Jeanson's patch set lays the groundwork for the addition of such a user, establishing the infrastructure to support the attachment of sleepable BPF programs to specific tracepoints. This ability must be supported with care since, as noted above, the kernel is often running in a context where sleeping is a bad idea. Specifically, a sleepable BPF program can only be attached to a tracepoint located in a region of code where sleeping is allowed in general.

There is no way to know automatically whether a given tracepoint can safely sleep or not, so existing tracepoints will not allow the attachment of sleepable BPF programs without explicit modification to that effect. Tracepoints are added to kernel code with the TRACE_EVENT() macro, along with a few variants; the brave of heart can see the horrifying macro-magic details in include/linux/tracepoint.h. Jeanson's patch set adds a new macro called TRACE_EVENT_FN_MAYSLEEP() as a variant of TRACE_EVENT_FN(), which defines a tracepoint that has associated registration and unregistration functions. Switching an existing tracepoint to the new macro indicates that it is safe to attach sleepable programs there.

The most significant change within those macros is that, if a tracepoint is marked as accepting sleepable programs, the tracers called when that tracepoint is hit will be run with preemption enabled. That is a necessary precondition to being able to handle page faults, but it also changes the expectations under which all of those tracers were written. The tracers themselves will need modification to run safely with preemption enabled — work that has not yet been posted. The patch set handles that situation, for now, by modifying the ftrace, perf, and BPF tracers to explicitly disable preemption internally, thus avoiding any unfortunate surprises.

As noted above, the use case that is driving this work is following pointers passed to system calls from user space. So it is not surprising that the first user of this capability will be system-call tracing. Jeanson's patch set changes the system-call entry and exit tracepoints to use TRACE_EVENT_FN_MAYSLEEP(), thus setting the stage for the attachment of sleepable programs that could rummage around in user-space memory in response to system calls.

There is only one piece that is missing at this point: actually fixing up the tracers and using the new infrastructure to attach and run sleepable BPF programs. As the cover letter to the patch set notes:

This series only implements the tracepoint infrastructure required to allow tracers to handle page faults. Modifying each tracer to handle those page faults would be a next step after we all agree on this piece of instrumentation infrastructure.

This may seem like a strange place to stop, just before making everything actually work, but changes at this point could have significant effects on the subsequent patches.

Based on the discussion so far, it doesn't appear that there is any need for big changes at this level of the code; most of the comments relate to details around the edges. If that situation holds, we should expect to see patches in the near future that finish the job and enable the attachment of sleepable tracepoint programs. That may well lead to another increase in the capability of the tracing infrastructure for Linux.

Index entries for this article
Kernel	BPF/Tracing
Kernel	Tracing/with BPF

Access user space data without sleeping

Posted Oct 30, 2020 7:54 UTC (Fri) by epa (subscriber, #39769) [Link] (6 responses)

Is there a way for kernel code to try accessing user space data, and either succeed immediately if it's in RAM, or fail immediately if it would need to be paged in? That might be good enough for most tracepoints.

Access user space data without sleeping

Posted Oct 30, 2020 11:23 UTC (Fri) by wEddy (guest, #135401) [Link]

copy_from_user_nofault() maybe? bpftrace can use it by bpf_probe_read_user() helper.

Access user space data without sleeping

Posted Oct 30, 2020 13:14 UTC (Fri) by compudj (subscriber, #43335) [Link] (2 responses)

One clarification: we do access user-space memory already from LTTng at system call enter/exit by using __copy_from_user_inatomic(). However, if the userspace pages are not paged in memory, the access fails and we either truncate (if our userspace strnlen fails when calculating a string length) or write zeroes into the trace rather than the user-space data.

We found out however that for things like security-related tooling which rely on grabbing the open(2) file name argument, this behavior where we cannot read the user-space data in specific conditions (which I suspect can be controlled by user-space by carefully making sure the page is _not_ paged in memory) is bad for security-related system behavior analysis through tracing.

Moreover, it's bad for our continuous integration, because we have test-cases where system call parameters are expected in the trace. This makes the tests flaky because they can then fail spuriously depending on what is present or not in the page table.

Of course, there are plenty of use-cases where it's good enough that the user-space data show up when available, and it's not a big deal if it happens to be missing in a few cases, but there are lots of tracing use-cases for live system monitoring which depend on having reliable data, and those require taking the page fault at system call entry/exit to fetch the user-space data.

Access user space data without sleeping

Posted Oct 30, 2020 15:27 UTC (Fri) by walters (subscriber, #7396) [Link] (1 responses)

Using BPF for security intersects at KRSI, right?
https://lwn.net/Articles/813261/

Also doing things like looking at file paths should be known to be fairly flawed in general even if it weren't racy just *loading* the path - bind mounts etc. can obscure what you're seeing. The SELinux model of e.g. having `etc_t` for /etc avoids all races and problems with comparing file paths.

Access user space data without sleeping

Posted Oct 30, 2020 16:10 UTC (Fri) by compudj (subscriber, #43335) [Link]

As far as my own comment is concerned, I'm discussing use a trace post-processing approach (or live trace streaming) through LTTng to analyze the behavior of a system either after the fact or in real-time (shortly after it has happened). There it is possible to reconstruct a model of the entire filesystem mounts and path hierarchy anywhere within the trace from a trace post-processing analysis tool.

I did not have the eBPF vs KRSI use-cases in mind when writing that comment.

Access user space data without sleeping

Posted Oct 30, 2020 21:48 UTC (Fri) by danobi (subscriber, #102249) [Link] (1 responses)

> Thus, for example, a tracepoint running on entry to the openat2() system call can see the pointer to the open_how structure passed by user space, but is unable to examine the contents of the structure itself.

IIUC, this is incorrect. BPF programs can dereference userspace memory with bpf_probe_read{,_user}. It's just that if that access would fault memory, the helper returns an error and the read memory is 0s. Unless the system is under memory pressure, I've usually only seen bpf_probe_read* fail on immutable strings stored in rodata.

For example, if you run the follow bpftrace one-liner:

# bpftrace -e 'tracepoint:syscalls:sys_enter_openat2 { printf("0x%x\n", args->how->flags) }' --btf -kk
...
0x40
0x410002
0x200000

against `tools/testing/selftests/openat2/openat2_test`, things seem to work right.

(the --btf flag resolves the tracepoint types, the -kk flag reports if bpf helpers return an error).

Access user space data without sleeping

Posted Oct 31, 2020 2:17 UTC (Sat) by simcop2387 (subscriber, #101710) [Link]

I don't know how it relates to tracepoints and other things, but this is mostly true. eBPF can do the dereferencing you mentioned, but cBPF can't. This is one of the things that differs for seccomp()'s cBPF programs.

Relief for insomniac tracepoints

Posted Nov 1, 2020 13:44 UTC (Sun) by ringerc (subscriber, #3071) [Link]

> There would be value, though, in the ability to look at user-space data in a tracepoint handler as well.

Gee, really?

Kernel people have been saying "Systemtap is obsolete! Use ebpf-tools! Use bpftrace!"

So I tried to convert a few of my simplest trace programs from systemtap to run as bpftrace scripts and found that it was *not possible to read the value of a short null terminated C string from userspace in bpftrace*. At least when I tried it.

perf isn't a lot better there either. Got static tracepoints with char* arguments? Good luck with that. It doesn't know how to capture the duration of a syscall without a lot of help or post-processing either. (strace -c can, but is expensive and limited).

I can't help wish various interested parties would actually finish one trace framework before replacing it with another that - again - only services the needs of kernel hackers.

So yeah. I can't say I'm entirely shocked that it might be desirable to read userspace memory when doing full-system tracing, complex program flow analysis, targeted performance work, etc.