Leading items
Welcome to the LWN.net Weekly Edition for February 27, 2020
This edition contains the following feature content:
- Impedance matching for BPF and LSM: can the Linux security module interface be adapted to BPF's needs?
- Memory-management optimization with DAMON: a data-access monitor that can report on patterns in memory use for better tuning of the memory-management subsystem.
- CAP_PERFMON — and new capabilities in general: a new Linux capability for perf_event_open() leads to the question of why new capabilities are such a rare occurrence.
- watch_mount(), watch_sb(), and fsinfo() (again): three new system calls proposed for extracting information about filesystems.
- A look at "BPF Performance Tools": a review of the recent book on using BPF tools for performance analysis.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
Impedance matching for BPF and LSM
The "kernel runtime security instrumentation" (KRSI) patch set has been making the rounds over the past few months; the idea is to use the Linux security module (LSM) hooks as a way to detect, and potentially deflect, active attacks against a running system. It does so by allowing BPF programs to be attached to the LSM hooks. That has caused some concern in the past about exposing the security hooks as external kernel APIs, which makes them potentially subject to the "don't break user space" edict. But there has been no real objection to the goals of KRSI. The fourth version of the patch set was posted by KP Singh on February 20; the concerns raised this time are about its impact on the LSM infrastructure.
The main change Singh made from the previous version effectively removed
KRSI from the standard LSM calling mechanisms by using BPF
"fexit" (for
function exit)
trampolines on the LSM hooks. That trampoline can efficiently call any BPF
programs that have been placed on the hook without the overhead associated
with the normal LSM path; in particular, it avoids the cost of the retpoline
mitigation for the Spectre hardware vulnerability. The KRSI hooks are enabled by static
keys, which means they have zero cost when they are not being used.
But it does mean that KRSI looks less like a normal LSM as Singh
acknowledged: "Since the code does not deal with security_hook_heads anymore, it goes
from 'being a BPF LSM' to 'BPF program attachment to LSM hooks'.
"
Casey Schaufler, who has done a lot of work on the LSM infrastructure
over the last few years, objected
to making KRSI a special case, however: "You aren't doing anything
so special that the general mechanism won't work.
" Singh agreed that
the standard LSM approach would work for KRSI, "but the cost of an
indirect call for all the hooks, even those that are completely unused
is not really acceptable for KRSI’s use cases
".
Kees Cook focused on the
performance issue, noting that making calls for each of the hooks, even
when there is nothing to do, is common for LSMs. Of the 230 hooks defined
in the LSM interface, only SELinux uses more than half (202), Smack is next
(108), and the rest use less than a hundred—several use four or less. He
would like to see some numbers on the performance gain from using static
keys to disable hooks that are not
required. It might make sense to use that mechanism for all of LSM, he said.
Singh agreed
that it would be useful to have some performance numbers; "I will do
some analysis and come back with the numbers.
" It is, after all, a
bit difficult to have a discussion about improving performance without some
data to guide the decision-making.
There are several intertwined pieces to the disagreement. The LSM infrastructure has generally not been seen as a performance bottleneck, at least until the advent of KRSI; instead, over recent years, the focus has been on generalizing that infrastructure to support arbitrary stacking of multiple LSMs in a running system. That has improved the performance of handling calls into multiple hooks (when more than one LSM defines one for a given operation) over the previous mechanism, but it was not geared to the high-performance requirements that KRSI is trying to bring to the LSM world.
In addition, the stacking work has made it so that LSMs can be stacked in any order; each defined hook for a given operation is called in order, the first that denies the operation "wins" and the operation fails without calling any others. That infrastructure is in place, but KRSI upends it to a certain extent. KRSI comes from the BPF world, so the list traversal and indirect calls used by the LSM infrastructure are seen as performance bottlenecks. KRSI always places itself (conceptually) last on the list and uses the BPF trampoline to avoid that overhead. That makes it a special case, unlike the other "normal" stackable LSMs, but that may be a case of a premature optimization, as Cook noted.
BPF developer Alexei Starovoitov does not
see it as "premature" at all, however. "I'm convinced that
avoiding the cost of retpoline in critical path is a
requirement for any new infrastructure in the kernel.
" He thought
that the LSM infrastructure should consider using static keys to enable its
hooks, and that the mechanism employed by KRSI should be used there:
[...] I really like that KRSI costs absolutely zero when it's not enabled. Attaching BPF prog to one hook preserves zero cost for all other hooks. And when one hook is BPF powered it's using direct call instead of super expensive retpoline.
But the insistence on treating KRSI differently than the other LSMs means
that perhaps KRSI should go its own way—or work on improving the LSM
infrastructure as a whole. Schaufler said
that he had not "gotten that memo
" on avoiding retpolines and
that the LSM infrastructure is not new. He is interested in looking at
using static keys, but is concerned that the mechanism is too specific to
use in the general case, where multiple LSMs can register hooks to be
called:
The new piece is not KRSI per se, Singh said, but the ability to attach BPF programs to the security hooks is new. There are techniques available to make that have zero cost, so it makes sense to use them:
[...] My analogy here is that if every tracepoint in the kernel were of the type:
if (trace_foo_enabled) // <-- Overhead here, solved with static key trace_foo(a); // <-- retpoline overhead, solved with fexit trampolinesIt would be very hard to justify enabling them on a production system, and the same can be said for BPF and KRSI.
The difficulty is that the LSM interface came about under a different set of constraints, Schaufler said. Those constraints have changed over time and the infrastructure is being worked on to improve its performance, but it still needs to work with the existing LSMs:
When LSM was introduced it was expected to be used by the lunatic fringe people with government mandated security requirements. Today it has a much greater general application. That's why I'm in the process of bringing it up to modern use case requirements. Performance is much more important now than it was before the use of LSM became popular.
[...] If BPF and KRSI are that performance critical you shouldn't be tying them to LSM, which is known to have overhead. If you can't accept the LSM overhead, get out of the LSM. Or, help us fix the LSM infrastructure to make its overhead closer to zero. Whether you believe it or not, a lot of work has gone into keeping the LSM overhead as small as possible while remaining sufficiently general to perform its function.
The goal of eliminating the retpoline overhead is reasonable, Cook said, but the
LSM world has not yet needed to do so. "I think it's a desirable goal, to be
sure, but this does appear to be an early optimization.
" He noted
there is something of an impedance mismatch; the BPF developers do not want
to see any performance hit associated with BPF, but the LSM developers
"do not want any new special cases in
LSM stacking
". So he suggested adding a "slow" KRSI that used the
LSM stacking infrastructure as it is today, followed by work to optimize
that calling path.
But Starovoitov thought
that KRSI should perhaps just go its own way. He proposed changing the BPF
program type from BPF_PROG_TYPE_LSM to
BPF_PROG_TYPE_OVERRIDE_RETURN and moving KRSI completely out of the
LSM world: "I don't see anything in LSM infra that KRSI can
reuse.
" He does not see a slow KRSI as an option and suggested that
perhaps the LSM interface and the new BPF program type should be made
mutually exclusive at kernel build time:
There are, of course, lots of users of the LSM interface, including most
distributions, so it might difficult to go that route, Schaufler said.
But Singh noted that
the users of a BPF_PROG_TYPE_OVERRIDE_RETURN feature may be highly
performance-sensitive such that they already disable LSMs "because of
the current performance characteristics
".
But Singh did think
that BPF_PROG_TYPE_OVERRIDE_RETURN might be useful on
its own,
separate from the KRSI work; he plans to split that out into its own patch
set. He agreed
with Cook's approach, as well, and plans to re-spin the KRSI patches to use
the standard LSM approach as a starting point; "we can follow-up on
performance
".
The clash here was the classic speed versus generality tradeoff that pops up in the kernel (and elsewhere) with some frequency. The BPF developers are laser-focused on their "baby"—and squeezing every last drop of performance out of it—but there is a wider world in kernel-land, some parts of which have different requirements. It would seem that a reasonable compromise has been found here. Preserving the generality of the LSM approach while gaining the performance improvements that the BPF developers have been working on would be a win for both, really—taking the kernel and its users along for the ride.
Memory-management optimization with DAMON
To a great extent, memory management is based on making predictions: which pages of memory will a given process need in the near future? Unfortunately, it turns out that predictions are hard, especially when they are about future events. In the absence of useful information sent back from the future, memory-management subsystems are forced to rely on observations of recent behavior and an assumption that said behavior is likely to continue. The kernel's memory-management decisions are opaque to user space, though, and often result in less-than-optimal performance. A pair of patch sets from SeongJae Park tries to make memory-usage patterns visible to user space, and to let user space change memory-management decisions in response.At the core of this new mechanism is the data access monitor or DAMON, which is intended to provide information on memory-access patterns to user space. Conceptually, its operation is simple; DAMON starts by dividing a process's address space into a number of equally sized regions. It then monitors accesses to each region, providing as its output a histogram of the number of accesses to each region. From that, the consumer of this information (in either user space or the kernel) can request changes to optimize the process's use of memory.
Reality is a bit more complex than that, of course. Current hardware allows for a huge address space, most of which is unused; dividing that space into (for example) 1000 regions could easily result in all of the used address space being pushed into just a couple of regions. So DAMON starts by splitting the address space into three large chunks which are, to a first approximation, the text, heap, and stack areas. Only those areas are monitored for access patterns.
For each region, DAMON tries to track the number of accesses. Watching every page in a region would be expensive, though, and one of the design goals of DAMON is to be efficient enough to run on production workloads. These objectives are reconciled by assuming that all pages in a given region have approximately equal access patterns, so there is no need to watch more than one of them. Thus, within each region, the "accessed" bit on a randomly selected page is cleared, then occasionally checked. If that page has been accessed, then the region is deemed to have been accessed.
It would be nice if a process being monitored would helpfully line up its memory-access patterns to match the regions chosen by DAMON, but such cooperation is rare in real-world systems. So the layout of those equally sized regions is unlikely to correspond well with how memory is actually being used. DAMON attempts to compensate for this by adjusting the regions on the fly as the process executes. Regions showing heavy access patterns are divided into smaller areas, while those seeing little use are coalesced into larger blocks. If all this works well, the result over time should be a zeroing-in on the truly hot areas of the target process's address space.
To control all of this, DAMON creates a set of virtual files in the debugfs filesystem. There is no access control implemented within DAMON itself, but those files are set up for root access only by default. All of the relevant parameters — target process, number of regions, and sampling and aggregation periods — can be configured by writing to those files. The resulting data can be read from debugfs; it is also possible to have the kernel write sampling data directly to a file, from which it can be processed at leisure. As an alternative, users can attach to a tracepoint to receive the data as it is generated; this makes it readily available to the perf tool, among other things.
That data, however it is obtained, is essentially a histogram; each memory region is a bin and the number of hits in that bin is recorded. That data can be analyzed by hand, of course; there is also a sample script that can feed it to gnuplot to present the information in a more graphic form. This information, Park says, can be highly useful:
That kind of speedup certainly justifies spending some time looking at a process's memory patterns. It would be even nicer, though, if the kernel could do that work itself — that is what a memory-management subsystem is supposed to be for, after all. As a step in that direction, Park has posted a separate patch set implementing the "data access monitoring-based memory operation schemes". This mechanism allows users to tell DAMON how to respond to specific sorts of access patterns. This is done through a new debugfs file ("schemes") that accepts lines like:
min-size max-size min-acc max-acc min-age max-age action
Each rule will apply to regions between min-size and max-size in length with access counts between min-acc and max-acc. These counts must have been accumulated in a region with an age between min-age and max-age. The "age" of a region is reset whenever a significant change happens; this can include the application of an action or a resizing of the region itself.
The action is, at this point, a command to be passed to an madvise() call on the region; supported values are MADV_WILLNEED, MADV_COLD, MADV_PAGEOUT, MADV_HUGEPAGE, and MADV_NOHUGEPAGE. Actions of this type could be used to, for example, explicitly force out a region that sees little use or to request that huge pages be used for hot regions. Comments within the patch set suggest that mlock is also envisioned as an action, but that is not currently implemented.
A mechanism like this has clear value when it comes to helping developers tune the memory-management subsystem for their workloads. It raises an interesting question, though: given that the kernel can be made to tune itself for better memory-management results, why isn't this capability a part of the memory-management subsystem itself? Bolting it on as a separate module might be useful for memory-management developers, who are likely interested in trying out various ideas. But one might well argue that production systems should Just Work without the need for this sort of manual tweaking, even if the tweaking is supported by a capable monitoring system. While DAMON looks like a useful tool now, users may be forgiven for hoping that it makes itself obsolete over time.
CAP_PERFMON — and new capabilities in general
The perf_event_open() system call is a complicated beast, requiring a fair amount of study to master. This call also has some interesting security implications: it can be used to obtain a lot of information about the running system, and the complexity of the underlying implementation has made it more than usually prone to unpleasant bugs. In current kernels, the security controls around perf_event_open() are simple, though: if you have the CAP_SYS_ADMIN capability, perf_event_open() is available to you (though the system administrator can make it available without any privilege at all). Some current work to create a new capability for the perf events subsystem would seem to make sense, raising the question of why adding new capabilities isn't done more often.Capabilities are a longstanding effort to split apart the traditional Unix superuser's powers into something more fine-grained, allowing administrators to give limited privileges where needed without making the recipients into full superusers. There are 37 capabilities defined in current Linux kernels, controlling the ability to carry out a range of tasks including configuring terminal devices, overriding resource limits, installing kernel modules, or adjusting the system time. Among these capabilities, though, is CAP_SYS_ADMIN, nominally the capability needed to perform system-administration tasks. CAP_SYS_ADMIN has become the default capability to require when nothing else seems to fit; it enables so many actions that it has long been known as "the new root".
A quick check shows well over 500 checks for CAP_SYS_ADMIN in the 5.6-rc kernel. During the 5.6 merge window, new checks were added to allow holders of CAP_SYS_ADMIN to send hardware-specific commands to obscure devices, configure time namespaces, load BPF programs for kernel operations structures, and open access to x86 MTRR registers. The perf events subsystem has also come to rely on CAP_SYS_ADMIN to keep unprivileged users out. As a result, to enable a user to call perf_event_open(), an administrator must also allow that user to mount filesystems, access PCI configuration spaces, tune memory-management policies, load BPF programs, and more. That is a lot of privilege to associate with a task that ordinary users are fairly likely to legitimately need to do.
This patch set from Alexey Budankov addresses that problem by creating a new capability called CAP_PERFMON to govern performance-monitoring tasks. With this patch installed, users (or their programs) could be granted CAP_PERFMON rather than CAP_SYS_ADMIN, enabling them to get performance data without adding all those other powers. Of course, CAP_SYS_ADMIN would still be sufficient to call perf_event_open(); otherwise the chances of breaking existing systems are high. But it would no longer be necessary if a user has CAP_PERFMON instead.
At a first look, this change seems relatively obvious; it is hard to complain about separating out a relatively constrained, low-danger activity from a powerful capability. But it does lead one to wonder why this kind of change is done so rarely. The last time a new capability was added was in 2014, when CAP_AUDIT_READ joined the set. It would appear that the last time a capability was split out of CAP_SYS_ADMIN was the creation of CAP_SYSLOG in 2010. Once something becomes part of CAP_SYS_ADMIN, it seems, it stays there. Why might that be the case?
One reason, of course, is the aforementioned compatibility issue: once CAP_SYS_ADMIN allows an action, it can never lose that power without possibly breaking existing systems. When Serge Hallyn added CAP_SYSLOG, he added the usual code that made things continue to work if the process in question had CAP_SYS_ADMIN. In that case, though, the kernel issues a warning that use of CAP_SYS_ADMIN for these operations is deprecated. Nearly ten years later, the compatibility code — and the warning — remain. Splitting capabilities out of CAP_SYS_ADMIN is less than fully rewarding when the power of CAP_SYS_ADMIN itself can never be reduced.
Adding capabilities has hazards of its own, in that existing code will know nothing about a new capability and what it might control. A program that clears bits out of a capability mask is likely to clear the new one, but that capability might be needed going forward. Experience has shown that running a privileged program with selectively removed capabilities can open up surprising vulnerabilities; every new capability potentially creates just that sort of situation. So capabilities must be added with care. There is a reason why the SELinux build has a check that explicitly fails if new capabilities have been added without corresponding changes in SELinux itself.
Then, there is the unfortunate fact that capabilities in Linux are seen by many as a failed experiment. Nobody has ever made a practical, fully capability-based system using them, and many of the defined capabilities are relatively easily escalated to full root powers. Linux systems above the kernel level have made limited use of them, if indeed capabilities have been used at all. It can be hard to generate enthusiasm for refining a system that can never work as was originally intended and which may never be used in any serious way.
As an example, one obvious way to use capabilities to reduce privilege would be to remove the setuid bit on existing utilities and install just the needed capabilities instead. The kernel has supported file-based capabilities since the 2.6.24 release in 2008 after all. Your editor's current system, running Fedora 31 (which includes "first" among its goals) contains a grand total of nine binaries with capabilities attached:
# getcap -r /
/usr/bin/gnome-keyring-daemon = cap_ipc_lock+ep
/usr/bin/clockdiff = cap_net_raw+p
/usr/bin/arping = cap_net_raw+p
/usr/bin/newuidmap = cap_setuid+ep
/usr/bin/newgidmap = cap_setgid+ep
/usr/bin/ping = cap_net_admin,cap_net_raw+p
/usr/bin/gnome-shell = cap_sys_nice+ep
/usr/sbin/mtr-packet = cap_net_raw+ep
/usr/sbin/suexec = cap_setgid,cap_setuid+ep
It is good to know that gnome-shell does not run setuid root, so capabilities have brought some value here. But that compares with 31 setuid root binaries; it would appear that there is no prospect of this distribution becoming capability-only anytime soon.
That said, there are signs of a shift with regard to capabilities. The never-ending desire to harden our systems against attacks is driving developers to take another look at Linux capabilities and how they might help. The Android system makes use of capabilities, for example. Systemd gives administrators extensive control over the capabilities granted to running programs. It may just be that, after many years of disuse, Linux capabilities are finally finding a place in deployed systems.
If that is the case, we may well see a renewed level of interest in increasing the granularity of the permissions controlled by capabilities. That could include splitting more powers out of CAP_SYS_ADMIN though, as noted above, that must be done carefully. CAP_SYS_ADMIN is unlikely to stop being the not-so-new root anytime soon, but perhaps it could be made into a capability that few programs need to have to get their work done.
watch_mount(), watch_sb(), and fsinfo() (again)
Filesystems, by design, hide a lot of complexity from users. At times, though, those users need to be able to look inside the black box and extract information about what is going on within a filesystem. Answering this need is David Howells, the creator of a number of filesystem-oriented system calls; in this patch set he tries to add three more, one of which we have seen before and two of which are new.The new system calls, watch_mount() and watch_sb(), provide ways for a process to request notifications whenever something changes at either a mount point (watch_mount()) or within a specific mounted filesystem (watch_sb(), the "sb" standing for "superblock"). For a mount point, events of interest can include the mounting or unmounting of filesystems anywhere below the mount point, the change of an attribute like read-only, movement of mount points, and more. Filesystem-specific events can also include attribute changes, along with filesystem errors, quota problems, or network issues for remote filesystems.
These system calls are built on a newer version of the event-notification mechanism that Howells has been working on for some time. In the past, getting notifications has involved opening a new device (/dev/watch_queue), but that interface has changed in the meantime. In the current version, a process calls pipe2() with the new O_NOTIFICATION_PIPE flag to create a special type of pipe meant for notification use. The writable side of this pipe is not used by the application; the file descriptor for the readable end can be passed to either of the new system calls:
int watch_mount(int dirfd, const char *path, unsigned int flags,
int watch_fd, int watch_id);
int watch_sb(int dirfd, const char *path, unsigned int flags,
int watch_fd, int watch_id);
In both cases, dirfd, path, and flags identify the directory of interest in the usual openat() style. The notification pipe is passed in as watch_fd, and watch_id is an integer value that will be returned in any generated events. There is a special case, though; if watch_id is -1, any existing watch using the given watch_fd will be removed.
The application receives events by reading from the pipe. By default all events affecting the given watch point will be returned. The application can, though, create a filter that is attached to the notification pipe with an ioctl() call. There's another ioctl() call to set the size of the buffer used to hold notifications sent to user space. Curious readers can see these system calls used in this sample program.
Unlike the system calls described above, fsinfo() has been seen before. Its prototype remains the same:
int fsinfo(int dirfd, const char *path, const struct fsinfo_params *params,
void *buffer, size_t buf_size);
As before, dirfd and path describe the filesystem for which information is requested; there is no flags argument here, but it is hidden within the params structure, which looks like this:
struct fsinfo_params {
__u32 at_flags;
__u32 flags;
__u32 request;
__u32 Nth;
__u32 Mth;
__u64 __reserved[3];
};
The at_flags field contains the same flags that one would ordinarily expect to see in an openat()-style system call. The request field describes the information that is being asked for; a number of possible values can be found in this patch from the series. Potentially available information ("potentially" because filesystems are not required to implement every possibility) include filesystem limits, timestamp resolution information, the volume UUID, the servers behind a remote filesystem, and more. For attributes that can have multiple values, the Nth and Mth fields can be used to select one in particular.
The format of the returned value is ... complex. Values are stored into the provided buffer in any of a number of formats, depending on what was requested. For some, a structure is returned; others return a string or a type called simply "opaque". There is some documentation in this patch, but it seems clear that potential users of this system call will have to do some digging to figure out the information that will be returned to them.
This patch set is now in its 17th revision, having evolved quite a bit over the years. The one comment on this version, so far, comes from James Bottomley, who suggested that there may not be a need for fsinfo() at all. Instead, with some changes to how fsconfig() (which is used to configure filesystem attributes) is implemented, it could be turned into an interface that could both set and read attributes. So far, Howells has not responded to that suggestion.
Overall, the fact that these patches have been through 17 revisions (so far) says a lot. Nobody doubts that getting this information out of the kernel would be useful, but the API remains complex and hard for potential users to understand. Whether that can be fixed while retaining the features provided by these system calls is not clear, though.
A look at "BPF Performance Tools"
BPF has exploded within the Linux world over the last few years, growing from its networking roots into the go-to tool for running custom in-kernel programs. Its role seems to expand with every kernel release into diverse areas such as security and device control. But none of that is the focus of a relatively new book from Brendan Gregg, BPF Performance Tools; it looks, instead, at how BPF provides visibility into the guts of the kernel. Finding performance bottlenecks of various sorts on (generally large) production systems is an area where BPF and the tool set that has grown up around it can excel; Gregg's book describes that landscape in great depth.
The book is meant to be both a way to learn about what BPF can do to improve the observability of Linux systems and applications and a reference guide to a large body of tools that Gregg and others have built up to peer inside the running system. Interestingly, it does not actually cover the underlying BPF virtual-machine instructions all that much, except in an appendix; the focus is on how to use BPF at a higher level. Even then, learning to actually write tools using the high-level environments (BCC and bpftrace) is not truly the intent either, though code samples for bpftrace abound. The book is definitely geared toward finding problems at multiple levels on Linux systems running in production.
It begins by introducing BPF, noting its origin as the Berkeley Packet Filter and its eventual upgrade to extended BPF (eBPF), before giving a quick overview of the tracing and sampling techniques available on Linux. It then gives a taste of what the BPF Compiler Collection (better known as BCC) can actually do by using canned tools to examine system-wide execve() calls and block I/O latency. The different levels of tracing available in a Linux system, from applications through system libraries and the system-call interface down to internal kernel tracepoints and hardware counters, are briefly described with an eye toward a few bpftrace "one-liners" to examine open() and openat() system calls. Examples of bpftrace one-liners (and more) can be found in Gregg's LWN article from July 2019 and a report on his talk at the 2019 Linux Storage, Filesystem, and Memory-Management Summit.
That first chapter would be useful to anyone who is curious what the BPF fuss is all about. The concepts introduced in the first chapter (and more) are spelled out in greater detail in the rest of Part I ("Technologies"). The book is meant to be read straight through, if desired, or simply used as a reference of the tools and techniques that can be used to track down problems in a system. That leads to a bit of repetitiveness here and there throughout, so that readers popping into a particular place will not be completely lost. It can be a bit irritating at times for those just reading through it, but it is probably unavoidable in a dual-purpose book like this.
BPF itself is a complicated beast, which hooks into a wide variety of facilities for gathering tracing information. That includes both static options (kernel tracepoints and user-level statically defined tracing (USDT) markers) and ways to insert dynamic instrumentation into the kernel (kprobes) or user-space programs (uprobes). BPF programs can be used to collect information from those sources (and others like hardware performance monitoring counters (PMCs) and perf_events), summarize it in-kernel, and display the results in a variety of forms. Chapter 2 describes all of them in some detail
One of the key advantages of BPF over other tracing techniques is that it does its work efficiently in the kernel and can simply present its results; many other tools require storing lots of information in memory or log files and then post-processing it to actually pull out the data of interest. Some also require adding code to the kernel, either by rebuilding it with a different configuration or by adding a kernel module; BPF dispenses with all of that. In addition, BPF has data structures and helper functions to collect the kinds of information that might be of interest (e.g. stack traces); descriptions of all of that is gathered up in Chapter 2 as well.
While using BPF is the focus of the book, Gregg does not ignore the other tools available for diagnosing problems. The chapter on the process of analyzing a system starts with a look at the goals and methodologies that can be used to narrow things down. There are two separate checklists that are presented as starting points. The first uses standard Linux tools (e.g. vmstat, pidstat, and sar) in a "60-second checklist", before moving into a checklist of BCC tools (e.g. execsnoop, biosnoop, and tcpaccept). Each of the entries on the checklists is described along with how the output can be useful in pinpointing where problems might be; the BCC tool descriptions also reference other parts of the book where they are described in even more detail.
Rounding out Part I are a chapter each for BCC and bpftrace covering their installation, internal operation, and how they can be used; each chapter has multiple examples of them in action. These days, many Linux distributions provide packages for both of these interfaces, including the tools developed using them. There is also a large set of tools that Gregg developed specifically for the book, which can be seen in the diagram below in red; the existing tools are shown in black. All of the new tools can be found in his GitHub repository
While the first part of the book gives a lot of useful context and a large, tasty bite of what BPF can do, the meat of the book is contained in Part II ("Using BPF Tools"). There are 11 separate chapters, each looking at a different area of the system with an eye toward how to use the tools and bpftrace one-liners to dig into the operation of that area. For example, there are chapters covering the CPU, memory, I/O, networking, security, applications, languages (e.g. Java), containers, and hypervisors.
Each chapter gives some background information to help understand the role of the area covered in the chapter; it also describes aspects of it that might lead to performance or other problems. The traditional tools for investigating problems are introduced with examples given of the kinds of information they can provide. The chapters then move into BPF tools and bpftrace programs (or one-liners) that can be used for troubleshooting and pinpointing problem areas. Many of the chapters have an "Optional Exercises" section with ideas for ways to extend the existing tools or write new ones either using BCC or bpftrace; the ones marked "advanced, unsolved" are, of course, particularly challenging.
The remaining parts of the book are supplemental material at some level. Part III ("Additional Topics") has a chapter on other BPF-based performance-analysis tools and one on "Tips, Tricks, and Common Problems". The final part is appendixes, including a list of all the one-liners used in Part II, a bpftrace cheat sheet, information on developing BCC-based and C-based BPF programs, and a reference on the BPF instruction set. That is followed by a glossary, bibliography, and an index.
I have a couple of nits to point out with the book, but overall it is an excellent book with comprehensive coverage of BPF-based tools and how to use them for investigating performance and other problems. The book can be a bit overwhelming at times, but that is really due to the subject matter at hand; there are lots of parts and pieces in the BPF landscape, so trying to keep them all straight can be a challenge.
I got a review copy of the EPUB version of the book from the publisher that I read in two different ways: on a tablet using Lithium and on my desktop with calibre; I did not try it on my Kobo E Ink reader as the layout of the book did not seem conducive to a small, monochrome screen. There were some rendering problems I encountered on Lithium, which I have used successfully with other technically oriented books; examples and tool output that spanned page boundaries on the screen would not display the piece on the next page. But there were links that would take you to a full-page rendering of the item, which could then be tapped to return to the right place. Calibre did not have that flaw and presumably other EPUB readers would not either, but it was obviously not annoying enough for me to go search out another reader.
The book has quite a number of in-line footnotes, which are useful; they often highlight the history and developer behind a particular tool. But the use of square-bracket-style links in the text left something to be desired. Clicking (or tapping) those would take you to an entry in the list after the bibliography, but each listed item was itself simply a link to a web URL. Some way to directly go to the linked-to item would have been a bit easier to navigate. Obviously, a dead-tree version of the book would not suffer from that lack, but paging to the list might be a bit painful as well. Perhaps newer editions could simply use regular footnotes for the links as well, making them directly selectable in electronic copies and saving the paging on paper copies.
While the book focuses on performance problems on "big iron"—many of the examples show output from 48-CPU systems—the techniques and tools will be useful for a wide variety of other environments. Tracking down bugs on a desktop system or gaining familiarity with the internals of the kernel are just two of the possibilities that the book helps unlock. Nearly anyone running Linux will find a bpftrace one-liner (or three) that will pique their curiosity. BPF Performance Tools is definitely worth a look for anyone curious about the workings of their Linux systems.
Page editor: Jonathan Corbet
Next page:
Brief items>>
