|
|
Log in / Subscribe / Register

Leading items

Welcome to the LWN.net Weekly Edition for February 27, 2020

This edition contains the following feature content:

This week's edition also includes these inner pages:

  • Brief items: Brief news items from throughout the community.
  • Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Impedance matching for BPF and LSM

By Jake Edge
February 26, 2020

The "kernel runtime security instrumentation" (KRSI) patch set has been making the rounds over the past few months; the idea is to use the Linux security module (LSM) hooks as a way to detect, and potentially deflect, active attacks against a running system. It does so by allowing BPF programs to be attached to the LSM hooks. That has caused some concern in the past about exposing the security hooks as external kernel APIs, which makes them potentially subject to the "don't break user space" edict. But there has been no real objection to the goals of KRSI. The fourth version of the patch set was posted by KP Singh on February 20; the concerns raised this time are about its impact on the LSM infrastructure.

The main change Singh made from the previous version effectively removed KRSI from the standard LSM calling mechanisms by using BPF "fexit" (for function exit) trampolines on the LSM hooks. That trampoline can efficiently call any BPF programs that have been placed on the hook without the overhead associated with the normal LSM path; in particular, it avoids the cost of the retpoline mitigation for the Spectre hardware vulnerability. The KRSI hooks are enabled by static keys, which means they have zero cost when they are not being used. But it does mean that KRSI looks less like a normal LSM as Singh acknowledged: "Since the code does not deal with security_hook_heads anymore, it goes from 'being a BPF LSM' to 'BPF program attachment to LSM hooks'."

Casey Schaufler, who has done a lot of work on the LSM infrastructure over the last few years, objected to making KRSI a special case, however: "You aren't doing anything so special that the general mechanism won't work." Singh agreed that the standard LSM approach would work for KRSI, "but the cost of an indirect call for all the hooks, even those that are completely unused is not really acceptable for KRSI’s use cases".

Kees Cook focused on the performance issue, noting that making calls for each of the hooks, even when there is nothing to do, is common for LSMs. Of the 230 hooks defined in the LSM interface, only SELinux uses more than half (202), Smack is next (108), and the rest use less than a hundred—several use four or less. He would like to see some numbers on the performance gain from using static keys to disable hooks that are not required. It might make sense to use that mechanism for all of LSM, he said. Singh agreed that it would be useful to have some performance numbers; "I will do some analysis and come back with the numbers." It is, after all, a bit difficult to have a discussion about improving performance without some data to guide the decision-making.

There are several intertwined pieces to the disagreement. The LSM infrastructure has generally not been seen as a performance bottleneck, at least until the advent of KRSI; instead, over recent years, the focus has been on generalizing that infrastructure to support arbitrary stacking of multiple LSMs in a running system. That has improved the performance of handling calls into multiple hooks (when more than one LSM defines one for a given operation) over the previous mechanism, but it was not geared to the high-performance requirements that KRSI is trying to bring to the LSM world.

In addition, the stacking work has made it so that LSMs can be stacked in any order; each defined hook for a given operation is called in order, the first that denies the operation "wins" and the operation fails without calling any others. That infrastructure is in place, but KRSI upends it to a certain extent. KRSI comes from the BPF world, so the list traversal and indirect calls used by the LSM infrastructure are seen as performance bottlenecks. KRSI always places itself (conceptually) last on the list and uses the BPF trampoline to avoid that overhead. That makes it a special case, unlike the other "normal" stackable LSMs, but that may be a case of a premature optimization, as Cook noted.

BPF developer Alexei Starovoitov does not see it as "premature" at all, however. "I'm convinced that avoiding the cost of retpoline in critical path is a requirement for any new infrastructure in the kernel." He thought that the LSM infrastructure should consider using static keys to enable its hooks, and that the mechanism employed by KRSI should be used there:

Just compiling with CONFIG_SECURITY adds "if (hlist_empty)" check for every hook. Some of those hooks are in critical path. This load+cmp can be avoided with static_key optimization. I think it's worth doing.

[...] I really like that KRSI costs absolutely zero when it's not enabled. Attaching BPF prog to one hook preserves zero cost for all other hooks. And when one hook is BPF powered it's using direct call instead of super expensive retpoline.

But the insistence on treating KRSI differently than the other LSMs means that perhaps KRSI should go its own way—or work on improving the LSM infrastructure as a whole. Schaufler said that he had not "gotten that memo" on avoiding retpolines and that the LSM infrastructure is not new. He is interested in looking at using static keys, but is concerned that the mechanism is too specific to use in the general case, where multiple LSMs can register hooks to be called:

I admit to being unfamiliar with the static_key implementation, but if it would work for a list of hooks rather than a singe hook, I'm all ears.

The new piece is not KRSI per se, Singh said, but the ability to attach BPF programs to the security hooks is new. There are techniques available to make that have zero cost, so it makes sense to use them:

There are other tracing / attachment [mechanisms] in the kernel which provide similar [guarantees] (using static keys and direct calls) and it seems regressive for KRSI to not leverage these known patterns and sacrifice performance [especially] in hotpaths. This provides users to use KRSI alongside other LSMs without paying extra cost for all the possible hooks.

[...] My analogy here is that if every tracepoint in the kernel were of the type:

if (trace_foo_enabled) // <-- Overhead here, solved with static key
   trace_foo(a);  // <-- retpoline overhead, solved with fexit trampolines
It would be very hard to justify enabling them on a production system, and the same can be said for BPF and KRSI.

The difficulty is that the LSM interface came about under a different set of constraints, Schaufler said. Those constraints have changed over time and the infrastructure is being worked on to improve its performance, but it still needs to work with the existing LSMs:

The LSM mechanism is not zero overhead. It never has been. That's why you can compile it out. You get added value at a price. You get the ability to use SELinux and KRSI together at a price. If that's unacceptable you can go the route of seccomp, which doesn't use LSM for many of the same reasons you're on about.

When LSM was introduced it was expected to be used by the lunatic fringe people with government mandated security requirements. Today it has a much greater general application. That's why I'm in the process of bringing it up to modern use case requirements. Performance is much more important now than it was before the use of LSM became popular.

[...] If BPF and KRSI are that performance critical you shouldn't be tying them to LSM, which is known to have overhead. If you can't accept the LSM overhead, get out of the LSM. Or, help us fix the LSM infrastructure to make its overhead closer to zero. Whether you believe it or not, a lot of work has gone into keeping the LSM overhead as small as possible while remaining sufficiently general to perform its function.

The goal of eliminating the retpoline overhead is reasonable, Cook said, but the LSM world has not yet needed to do so. "I think it's a desirable goal, to be sure, but this does appear to be an early optimization." He noted there is something of an impedance mismatch; the BPF developers do not want to see any performance hit associated with BPF, but the LSM developers "do not want any new special cases in LSM stacking". So he suggested adding a "slow" KRSI that used the LSM stacking infrastructure as it is today, followed by work to optimize that calling path.

But Starovoitov thought that KRSI should perhaps just go its own way. He proposed changing the BPF program type from BPF_PROG_TYPE_LSM to BPF_PROG_TYPE_OVERRIDE_RETURN and moving KRSI completely out of the LSM world: "I don't see anything in LSM infra that KRSI can reuse." He does not see a slow KRSI as an option and suggested that perhaps the LSM interface and the new BPF program type should be made mutually exclusive at kernel build time:

It may seem as a downside that it will force a choice on kernel users. Either they build the kernel with CONFIG_SECURITY and their choice of LSMs or build the kernel with CONFIG_BPF_OVERRIDE_RETURN and use BPF_PROG_TYPE_OVERRIDE_RETURN programs to enforce any kind of policy. I think it's a pro not a con.

There are, of course, lots of users of the LSM interface, including most distributions, so it might difficult to go that route, Schaufler said. But Singh noted that the users of a BPF_PROG_TYPE_OVERRIDE_RETURN feature may be highly performance-sensitive such that they already disable LSMs "because of the current performance characteristics". But Singh did think that BPF_PROG_TYPE_OVERRIDE_RETURN might be useful on its own, separate from the KRSI work; he plans to split that out into its own patch set. He agreed with Cook's approach, as well, and plans to re-spin the KRSI patches to use the standard LSM approach as a starting point; "we can follow-up on performance".

The clash here was the classic speed versus generality tradeoff that pops up in the kernel (and elsewhere) with some frequency. The BPF developers are laser-focused on their "baby"—and squeezing every last drop of performance out of it—but there is a wider world in kernel-land, some parts of which have different requirements. It would seem that a reasonable compromise has been found here. Preserving the generality of the LSM approach while gaining the performance improvements that the BPF developers have been working on would be a win for both, really—taking the kernel and its users along for the ride.

Comments (21 posted)

Memory-management optimization with DAMON

By Jonathan Corbet
February 20, 2020
To a great extent, memory management is based on making predictions: which pages of memory will a given process need in the near future? Unfortunately, it turns out that predictions are hard, especially when they are about future events. In the absence of useful information sent back from the future, memory-management subsystems are forced to rely on observations of recent behavior and an assumption that said behavior is likely to continue. The kernel's memory-management decisions are opaque to user space, though, and often result in less-than-optimal performance. A pair of patch sets from SeongJae Park tries to make memory-usage patterns visible to user space, and to let user space change memory-management decisions in response.

At the core of this new mechanism is the data access monitor or DAMON, which is intended to provide information on memory-access patterns to user space. Conceptually, its operation is simple; DAMON starts by dividing a process's address space into a number of equally sized regions. It then monitors accesses to each region, providing as its output a histogram of the number of accesses to each region. From that, the consumer of this information (in either user space or the kernel) can request changes to optimize the process's use of memory.

Reality is a bit more complex than that, of course. Current hardware allows for a huge address space, most of which is unused; dividing that space into (for example) 1000 regions could easily result in all of the used address space being pushed into just a couple of regions. So DAMON starts by splitting the address space into three large chunks which are, to a first approximation, the text, heap, and stack areas. Only those areas are monitored for access patterns.

For each region, DAMON tries to track the number of accesses. Watching every page in a region would be expensive, though, and one of the design goals of DAMON is to be efficient enough to run on production workloads. These objectives are reconciled by assuming that all pages in a given region have approximately equal access patterns, so there is no need to watch more than one of them. Thus, within each region, the "accessed" bit on a randomly selected page is cleared, then occasionally checked. If that page has been accessed, then the region is deemed to have been accessed.

It would be nice if a process being monitored would helpfully line up its memory-access patterns to match the regions chosen by DAMON, but such cooperation is rare in real-world systems. So the layout of those equally sized regions is unlikely to correspond well with how memory is actually being used. DAMON attempts to compensate for this by adjusting the regions on the fly as the process executes. Regions showing heavy access patterns are divided into smaller areas, while those seeing little use are coalesced into larger blocks. If all this works well, the result over time should be a zeroing-in on the truly hot areas of the target process's address space.

To control all of this, DAMON creates a set of virtual files in the debugfs filesystem. There is no access control implemented within DAMON itself, but those files are set up for root access only by default. All of the relevant parameters — target process, number of regions, and sampling and aggregation periods — can be configured by writing to those files. The resulting data can be read from debugfs; it is also possible to have the kernel write sampling data directly to a file, from which it can be processed at leisure. As an alternative, users can attach to a tracepoint to receive the data as it is generated; this makes it readily available to the perf tool, among other things.

That data, however it is obtained, is essentially a histogram; each memory region is a bin and the number of hits in that bin is recorded. That data can be analyzed by hand, of course; there is also a sample script that can feed it to gnuplot to present the information in a more graphic form. This information, Park says, can be highly useful:

To see the usefulness of the monitoring, we optimized 9 memory intensive workloads among them for memory pressure situations using the DAMON outputs. In detail, we identified frequently accessed memory regions in each workload based on the DAMON results and protected them with mlock() system calls. The optimized versions consistently show speedup (2.55x in best case, 1.65x in average) under memory pressure.

That kind of speedup certainly justifies spending some time looking at a process's memory patterns. It would be even nicer, though, if the kernel could do that work itself — that is what a memory-management subsystem is supposed to be for, after all. As a step in that direction, Park has posted a separate patch set implementing the "data access monitoring-based memory operation schemes". This mechanism allows users to tell DAMON how to respond to specific sorts of access patterns. This is done through a new debugfs file ("schemes") that accepts lines like:

    min-size max-size min-acc max-acc min-age max-age action

Each rule will apply to regions between min-size and max-size in length with access counts between min-acc and max-acc. These counts must have been accumulated in a region with an age between min-age and max-age. The "age" of a region is reset whenever a significant change happens; this can include the application of an action or a resizing of the region itself.

The action is, at this point, a command to be passed to an madvise() call on the region; supported values are MADV_WILLNEED, MADV_COLD, MADV_PAGEOUT, MADV_HUGEPAGE, and MADV_NOHUGEPAGE. Actions of this type could be used to, for example, explicitly force out a region that sees little use or to request that huge pages be used for hot regions. Comments within the patch set suggest that mlock is also envisioned as an action, but that is not currently implemented.

A mechanism like this has clear value when it comes to helping developers tune the memory-management subsystem for their workloads. It raises an interesting question, though: given that the kernel can be made to tune itself for better memory-management results, why isn't this capability a part of the memory-management subsystem itself? Bolting it on as a separate module might be useful for memory-management developers, who are likely interested in trying out various ideas. But one might well argue that production systems should Just Work without the need for this sort of manual tweaking, even if the tweaking is supported by a capable monitoring system. While DAMON looks like a useful tool now, users may be forgiven for hoping that it makes itself obsolete over time.

Comments (9 posted)

CAP_PERFMON — and new capabilities in general

By Jonathan Corbet
February 21, 2020
The perf_event_open() system call is a complicated beast, requiring a fair amount of study to master. This call also has some interesting security implications: it can be used to obtain a lot of information about the running system, and the complexity of the underlying implementation has made it more than usually prone to unpleasant bugs. In current kernels, the security controls around perf_event_open() are simple, though: if you have the CAP_SYS_ADMIN capability, perf_event_open() is available to you (though the system administrator can make it available without any privilege at all). Some current work to create a new capability for the perf events subsystem would seem to make sense, raising the question of why adding new capabilities isn't done more often.

Capabilities are a longstanding effort to split apart the traditional Unix superuser's powers into something more fine-grained, allowing administrators to give limited privileges where needed without making the recipients into full superusers. There are 37 capabilities defined in current Linux kernels, controlling the ability to carry out a range of tasks including configuring terminal devices, overriding resource limits, installing kernel modules, or adjusting the system time. Among these capabilities, though, is CAP_SYS_ADMIN, nominally the capability needed to perform system-administration tasks. CAP_SYS_ADMIN has become the default capability to require when nothing else seems to fit; it enables so many actions that it has long been known as "the new root".

A quick check shows well over 500 checks for CAP_SYS_ADMIN in the 5.6-rc kernel. During the 5.6 merge window, new checks were added to allow holders of CAP_SYS_ADMIN to send hardware-specific commands to obscure devices, configure time namespaces, load BPF programs for kernel operations structures, and open access to x86 MTRR registers. The perf events subsystem has also come to rely on CAP_SYS_ADMIN to keep unprivileged users out. As a result, to enable a user to call perf_event_open(), an administrator must also allow that user to mount filesystems, access PCI configuration spaces, tune memory-management policies, load BPF programs, and more. That is a lot of privilege to associate with a task that ordinary users are fairly likely to legitimately need to do.

This patch set from Alexey Budankov addresses that problem by creating a new capability called CAP_PERFMON to govern performance-monitoring tasks. With this patch installed, users (or their programs) could be granted CAP_PERFMON rather than CAP_SYS_ADMIN, enabling them to get performance data without adding all those other powers. Of course, CAP_SYS_ADMIN would still be sufficient to call perf_event_open(); otherwise the chances of breaking existing systems are high. But it would no longer be necessary if a user has CAP_PERFMON instead.

At a first look, this change seems relatively obvious; it is hard to complain about separating out a relatively constrained, low-danger activity from a powerful capability. But it does lead one to wonder why this kind of change is done so rarely. The last time a new capability was added was in 2014, when CAP_AUDIT_READ joined the set. It would appear that the last time a capability was split out of CAP_SYS_ADMIN was the creation of CAP_SYSLOG in 2010. Once something becomes part of CAP_SYS_ADMIN, it seems, it stays there. Why might that be the case?

One reason, of course, is the aforementioned compatibility issue: once CAP_SYS_ADMIN allows an action, it can never lose that power without possibly breaking existing systems. When Serge Hallyn added CAP_SYSLOG, he added the usual code that made things continue to work if the process in question had CAP_SYS_ADMIN. In that case, though, the kernel issues a warning that use of CAP_SYS_ADMIN for these operations is deprecated. Nearly ten years later, the compatibility code — and the warning — remain. Splitting capabilities out of CAP_SYS_ADMIN is less than fully rewarding when the power of CAP_SYS_ADMIN itself can never be reduced.

Adding capabilities has hazards of its own, in that existing code will know nothing about a new capability and what it might control. A program that clears bits out of a capability mask is likely to clear the new one, but that capability might be needed going forward. Experience has shown that running a privileged program with selectively removed capabilities can open up surprising vulnerabilities; every new capability potentially creates just that sort of situation. So capabilities must be added with care. There is a reason why the SELinux build has a check that explicitly fails if new capabilities have been added without corresponding changes in SELinux itself.

Then, there is the unfortunate fact that capabilities in Linux are seen by many as a failed experiment. Nobody has ever made a practical, fully capability-based system using them, and many of the defined capabilities are relatively easily escalated to full root powers. Linux systems above the kernel level have made limited use of them, if indeed capabilities have been used at all. It can be hard to generate enthusiasm for refining a system that can never work as was originally intended and which may never be used in any serious way.

As an example, one obvious way to use capabilities to reduce privilege would be to remove the setuid bit on existing utilities and install just the needed capabilities instead. The kernel has supported file-based capabilities since the 2.6.24 release in 2008 after all. Your editor's current system, running Fedora 31 (which includes "first" among its goals) contains a grand total of nine binaries with capabilities attached:

    # getcap -r /
    /usr/bin/gnome-keyring-daemon = cap_ipc_lock+ep
    /usr/bin/clockdiff = cap_net_raw+p
    /usr/bin/arping = cap_net_raw+p
    /usr/bin/newuidmap = cap_setuid+ep
    /usr/bin/newgidmap = cap_setgid+ep
    /usr/bin/ping = cap_net_admin,cap_net_raw+p
    /usr/bin/gnome-shell = cap_sys_nice+ep
    /usr/sbin/mtr-packet = cap_net_raw+ep
    /usr/sbin/suexec = cap_setgid,cap_setuid+ep

It is good to know that gnome-shell does not run setuid root, so capabilities have brought some value here. But that compares with 31 setuid root binaries; it would appear that there is no prospect of this distribution becoming capability-only anytime soon.

That said, there are signs of a shift with regard to capabilities. The never-ending desire to harden our systems against attacks is driving developers to take another look at Linux capabilities and how they might help. The Android system makes use of capabilities, for example. Systemd gives administrators extensive control over the capabilities granted to running programs. It may just be that, after many years of disuse, Linux capabilities are finally finding a place in deployed systems.

If that is the case, we may well see a renewed level of interest in increasing the granularity of the permissions controlled by capabilities. That could include splitting more powers out of CAP_SYS_ADMIN though, as noted above, that must be done carefully. CAP_SYS_ADMIN is unlikely to stop being the not-so-new root anytime soon, but perhaps it could be made into a capability that few programs need to have to get their work done.

Comments (23 posted)

watch_mount(), watch_sb(), and fsinfo() (again)

By Jonathan Corbet
February 24, 2020
Filesystems, by design, hide a lot of complexity from users. At times, though, those users need to be able to look inside the black box and extract information about what is going on within a filesystem. Answering this need is David Howells, the creator of a number of filesystem-oriented system calls; in this patch set he tries to add three more, one of which we have seen before and two of which are new.

The new system calls, watch_mount() and watch_sb(), provide ways for a process to request notifications whenever something changes at either a mount point (watch_mount()) or within a specific mounted filesystem (watch_sb(), the "sb" standing for "superblock"). For a mount point, events of interest can include the mounting or unmounting of filesystems anywhere below the mount point, the change of an attribute like read-only, movement of mount points, and more. Filesystem-specific events can also include attribute changes, along with filesystem errors, quota problems, or network issues for remote filesystems.

These system calls are built on a newer version of the event-notification mechanism that Howells has been working on for some time. In the past, getting notifications has involved opening a new device (/dev/watch_queue), but that interface has changed in the meantime. In the current version, a process calls pipe2() with the new O_NOTIFICATION_PIPE flag to create a special type of pipe meant for notification use. The writable side of this pipe is not used by the application; the file descriptor for the readable end can be passed to either of the new system calls:

    int watch_mount(int dirfd, const char *path, unsigned int flags,
    		    int watch_fd, int watch_id);
    int watch_sb(int dirfd, const char *path, unsigned int flags,
    		 int watch_fd, int watch_id);

In both cases, dirfd, path, and flags identify the directory of interest in the usual openat() style. The notification pipe is passed in as watch_fd, and watch_id is an integer value that will be returned in any generated events. There is a special case, though; if watch_id is -1, any existing watch using the given watch_fd will be removed.

The application receives events by reading from the pipe. By default all events affecting the given watch point will be returned. The application can, though, create a filter that is attached to the notification pipe with an ioctl() call. There's another ioctl() call to set the size of the buffer used to hold notifications sent to user space. Curious readers can see these system calls used in this sample program.

Unlike the system calls described above, fsinfo() has been seen before. Its prototype remains the same:

    int fsinfo(int dirfd, const char *path, const struct fsinfo_params *params,
	       void *buffer, size_t buf_size);

As before, dirfd and path describe the filesystem for which information is requested; there is no flags argument here, but it is hidden within the params structure, which looks like this:

    struct fsinfo_params {
	__u32	at_flags;
	__u32	flags;
	__u32	request;
	__u32	Nth;
	__u32	Mth;
	__u64	__reserved[3];
    };

The at_flags field contains the same flags that one would ordinarily expect to see in an openat()-style system call. The request field describes the information that is being asked for; a number of possible values can be found in this patch from the series. Potentially available information ("potentially" because filesystems are not required to implement every possibility) include filesystem limits, timestamp resolution information, the volume UUID, the servers behind a remote filesystem, and more. For attributes that can have multiple values, the Nth and Mth fields can be used to select one in particular.

The format of the returned value is ... complex. Values are stored into the provided buffer in any of a number of formats, depending on what was requested. For some, a structure is returned; others return a string or a type called simply "opaque". There is some documentation in this patch, but it seems clear that potential users of this system call will have to do some digging to figure out the information that will be returned to them.

This patch set is now in its 17th revision, having evolved quite a bit over the years. The one comment on this version, so far, comes from James Bottomley, who suggested that there may not be a need for fsinfo() at all. Instead, with some changes to how fsconfig() (which is used to configure filesystem attributes) is implemented, it could be turned into an interface that could both set and read attributes. So far, Howells has not responded to that suggestion.

Overall, the fact that these patches have been through 17 revisions (so far) says a lot. Nobody doubts that getting this information out of the kernel would be useful, but the API remains complex and hard for potential users to understand. Whether that can be fixed while retaining the features provided by these system calls is not clear, though.

Comments (13 posted)

A look at "BPF Performance Tools"

By Jake Edge
February 26, 2020

BPF has exploded within the Linux world over the last few years, growing from its networking roots into the go-to tool for running custom in-kernel programs. Its role seems to expand with every kernel release into diverse areas such as security and device control. But none of that is the focus of a relatively new book from Brendan Gregg, BPF Performance Tools; it looks, instead, at how BPF provides visibility into the guts of the kernel. Finding performance bottlenecks of various sorts on (generally large) production systems is an area where BPF and the tool set that has grown up around it can excel; Gregg's book describes that landscape in great depth.

[Book cover]

The book is meant to be both a way to learn about what BPF can do to improve the observability of Linux systems and applications and a reference guide to a large body of tools that Gregg and others have built up to peer inside the running system. Interestingly, it does not actually cover the underlying BPF virtual-machine instructions all that much, except in an appendix; the focus is on how to use BPF at a higher level. Even then, learning to actually write tools using the high-level environments (BCC and bpftrace) is not truly the intent either, though code samples for bpftrace abound. The book is definitely geared toward finding problems at multiple levels on Linux systems running in production.

It begins by introducing BPF, noting its origin as the Berkeley Packet Filter and its eventual upgrade to extended BPF (eBPF), before giving a quick overview of the tracing and sampling techniques available on Linux. It then gives a taste of what the BPF Compiler Collection (better known as BCC) can actually do by using canned tools to examine system-wide execve() calls and block I/O latency. The different levels of tracing available in a Linux system, from applications through system libraries and the system-call interface down to internal kernel tracepoints and hardware counters, are briefly described with an eye toward a few bpftrace "one-liners" to examine open() and openat() system calls. Examples of bpftrace one-liners (and more) can be found in Gregg's LWN article from July 2019 and a report on his talk at the 2019 Linux Storage, Filesystem, and Memory-Management Summit.

That first chapter would be useful to anyone who is curious what the BPF fuss is all about. The concepts introduced in the first chapter (and more) are spelled out in greater detail in the rest of Part I ("Technologies"). The book is meant to be read straight through, if desired, or simply used as a reference of the tools and techniques that can be used to track down problems in a system. That leads to a bit of repetitiveness here and there throughout, so that readers popping into a particular place will not be completely lost. It can be a bit irritating at times for those just reading through it, but it is probably unavoidable in a dual-purpose book like this.

BPF itself is a complicated beast, which hooks into a wide variety of facilities for gathering tracing information. That includes both static options (kernel tracepoints and user-level statically defined tracing (USDT) markers) and ways to insert dynamic instrumentation into the kernel (kprobes) or user-space programs (uprobes). BPF programs can be used to collect information from those sources (and others like hardware performance monitoring counters (PMCs) and perf_events), summarize it in-kernel, and display the results in a variety of forms. Chapter 2 describes all of them in some detail

One of the key advantages of BPF over other tracing techniques is that it does its work efficiently in the kernel and can simply present its results; many other tools require storing lots of information in memory or log files and then post-processing it to actually pull out the data of interest. Some also require adding code to the kernel, either by rebuilding it with a different configuration or by adding a kernel module; BPF dispenses with all of that. In addition, BPF has data structures and helper functions to collect the kinds of information that might be of interest (e.g. stack traces); descriptions of all of that is gathered up in Chapter 2 as well.

While using BPF is the focus of the book, Gregg does not ignore the other tools available for diagnosing problems. The chapter on the process of analyzing a system starts with a look at the goals and methodologies that can be used to narrow things down. There are two separate checklists that are presented as starting points. The first uses standard Linux tools (e.g. vmstat, pidstat, and sar) in a "60-second checklist", before moving into a checklist of BCC tools (e.g. execsnoop, biosnoop, and tcpaccept). Each of the entries on the checklists is described along with how the output can be useful in pinpointing where problems might be; the BCC tool descriptions also reference other parts of the book where they are described in even more detail.

Rounding out Part I are a chapter each for BCC and bpftrace covering their installation, internal operation, and how they can be used; each chapter has multiple examples of them in action. These days, many Linux distributions provide packages for both of these interfaces, including the tools developed using them. There is also a large set of tools that Gregg developed specifically for the book, which can be seen in the diagram below in red; the existing tools are shown in black. All of the new tools can be found in his GitHub repository

[Tools diagram]

While the first part of the book gives a lot of useful context and a large, tasty bite of what BPF can do, the meat of the book is contained in Part II ("Using BPF Tools"). There are 11 separate chapters, each looking at a different area of the system with an eye toward how to use the tools and bpftrace one-liners to dig into the operation of that area. For example, there are chapters covering the CPU, memory, I/O, networking, security, applications, languages (e.g. Java), containers, and hypervisors.

Each chapter gives some background information to help understand the role of the area covered in the chapter; it also describes aspects of it that might lead to performance or other problems. The traditional tools for investigating problems are introduced with examples given of the kinds of information they can provide. The chapters then move into BPF tools and bpftrace programs (or one-liners) that can be used for troubleshooting and pinpointing problem areas. Many of the chapters have an "Optional Exercises" section with ideas for ways to extend the existing tools or write new ones either using BCC or bpftrace; the ones marked "advanced, unsolved" are, of course, particularly challenging.

The remaining parts of the book are supplemental material at some level. Part III ("Additional Topics") has a chapter on other BPF-based performance-analysis tools and one on "Tips, Tricks, and Common Problems". The final part is appendixes, including a list of all the one-liners used in Part II, a bpftrace cheat sheet, information on developing BCC-based and C-based BPF programs, and a reference on the BPF instruction set. That is followed by a glossary, bibliography, and an index.

I have a couple of nits to point out with the book, but overall it is an excellent book with comprehensive coverage of BPF-based tools and how to use them for investigating performance and other problems. The book can be a bit overwhelming at times, but that is really due to the subject matter at hand; there are lots of parts and pieces in the BPF landscape, so trying to keep them all straight can be a challenge.

I got a review copy of the EPUB version of the book from the publisher that I read in two different ways: on a tablet using Lithium and on my desktop with calibre; I did not try it on my Kobo E Ink reader as the layout of the book did not seem conducive to a small, monochrome screen. There were some rendering problems I encountered on Lithium, which I have used successfully with other technically oriented books; examples and tool output that spanned page boundaries on the screen would not display the piece on the next page. But there were links that would take you to a full-page rendering of the item, which could then be tapped to return to the right place. Calibre did not have that flaw and presumably other EPUB readers would not either, but it was obviously not annoying enough for me to go search out another reader.

The book has quite a number of in-line footnotes, which are useful; they often highlight the history and developer behind a particular tool. But the use of square-bracket-style links in the text left something to be desired. Clicking (or tapping) those would take you to an entry in the list after the bibliography, but each listed item was itself simply a link to a web URL. Some way to directly go to the linked-to item would have been a bit easier to navigate. Obviously, a dead-tree version of the book would not suffer from that lack, but paging to the list might be a bit painful as well. Perhaps newer editions could simply use regular footnotes for the links as well, making them directly selectable in electronic copies and saving the paging on paper copies.

While the book focuses on performance problems on "big iron"—many of the examples show output from 48-CPU systems—the techniques and tools will be useful for a wide variety of other environments. Tracking down bugs on a desktop system or gaining familiarity with the internals of the kernel are just two of the possibilities that the book helps unlock. Nearly anyone running Linux will find a bpftrace one-liner (or three) that will pique their curiosity. BPF Performance Tools is definitely worth a look for anyone curious about the workings of their Linux systems.

Comments (6 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>


Copyright © 2020, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds