Kernel development [LWN.net]

Kernel release status

The current development kernel is 4.6-rc3, released on April 10. "The biggest single patch is the resurrection of the olpc_dcon staging driver, that wasn't so obsolete after all. There was a missed opportunity there, since the resurrection of that driver missed Easter by a week. We'll do better in the comedic timing department next time, I promise."

Stable updates: 4.5.1, 4.4.7, and 3.14.66 were released on April 12.

Comments (none posted)

Quotes of the week

This is a disease among people who have been taught computer science. People think that "designing with extensions in mind" is a good idea.

It's a _horrible_ idea.

If you think that "design with extensions in mind" is a good idea, you're basically saying "I don't know what I might want to do".

— Linus Torvalds

Our recent experience with the Linux scheduler revealed that the pressure to work around the challenging properties of modern hardware, such as non-uniform memory access latencies (NUMA), high costs of cache coherency and synchronization, and diverging CPU and memory latencies, resulted in a scheduler with an incredibly complex implementation. As a result, the very basic function of the scheduler, which is to make sure that runnable threads use idle cores, fell through the cracks.

— Jean-Pierre Lozi et. al. [PDF]

Comments (16 posted)

The linux-stable security tree project

Sasha Levin has announced the creation of the "linux-stable security tree" project. The idea is to take the current stable updates and filter out everything that isn't identified as a security fix. "Quite a few users of the stable trees pointed out that on complex deployments, where validation is non-trivial, there is little incentive to follow the stable tree after the product has been deployed to production. There is no interest in 'random' kernel fixes and the only requirements are to keep up with security vulnerabilities."

Full Story (comments: 13)

Tracepoints with BPF

By Jonathan Corbet
April 13, 2016

One of the attractive features of tracing tools like SystemTap or DTrace is the ability to load code into the kernel to perform first-level analysis on the trace data stream. Tracing can produce vast amounts of data, but that data can often be reduced considerably by some simple processing — incrementing histogram buckets, for example. Current kernels have a wealth of tracepoints, but they lack the ability to perform arbitrary processing of trace events in kernel space before exporting the result. It would appear, though, that this situation will change as the result of a set of patches targeted for the 4.7 release.

It should come as no surprise to regular LWN readers at this point that the technology being used for the loading of code into the kernel is the BPF virtual machine. BPF allows code to be executed in kernel space under tight constraints; among other things, it can only access data that is explicitly provided to it and it cannot contain loops; thus, it is guaranteed to run within a bounded time. BPF code can also be translated to native code with the in-kernel just-in-time compiler, making it fast to run. This combination of attributes has helped BPF to move beyond the networking stack and make inroads into a number of kernel subsystems.

Every BPF program loaded into the kernel has a specific type assigned to it; that type restricts the places where the program may be run. The patch set from Alexei Starovoitov creates a new type (BPF_PROG_TYPE_TRACEPOINT) for programs intended to be attached to tracepoints. Those programs can then be loaded into the kernel with the bpf() system call. Actually attaching a program to a tracepoint is done by opening the tracepoint file (in debugfs or tracefs), reading the tracepoint ID, then using the PERF_EVENT_IOC_SET_BPF ioctl() command. That command exists in current kernels to allow BPF programs to be attached to kprobes; the patch set extends it to do the right thing depending on the type of BPF program passed to it.

When a tracepoint with a BPF program attached to it fires, that program will be run. The "context" area passed to the program is simply the tracepoint data as it would be passed to user space, except that the "common" fields are not accessible. As an example, the patch set includes a sample that attaches to the sched/sched_switch tracepoint, which fires when the scheduler switches execution from one process to another. The format file for that tracepoint (found in the tracepoint directory in debugfs or tracefs) provides the following data:

    field:unsigned short common_type;	offset:0;	size:2;	signed:0;
    field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
    field:unsigned char common_preempt_count;	offset:3;	size:1;signed:0;
    field:int common_pid;	offset:4;	size:4;	signed:1;

    field:char prev_comm[16];	offset:8;	size:16;	signed:1;
    field:pid_t prev_pid;	offset:24;	size:4;	signed:1;
    field:int prev_prio;	offset:28;	size:4;	signed:1;
    field:long prev_state;	offset:32;	size:8;	signed:1;
    field:char next_comm[16];	offset:40;	size:16;	signed:1;
    field:pid_t next_pid;	offset:56;	size:4;	signed:1;
    field:int next_prio;	offset:60;	size:4;	signed:1;

Any program that accesses tracepoint data is expected to read this file to figure out which data is available and where it is to be found; failure to do so risks trouble in the future should the data associated with this tracepoint change. An in-kernel BPF program cannot read this file, so another solution must be found. That solution is for the developer to read the format file and turn it into a C structure; there is a tool (called tplist) that will do this job. The patch set contains the following structure, which was generated with tplist:

    /* taken from /sys/kernel/debug/tracing/events/sched/sched_switch/format */
    struct sched_switch_args {
	unsigned long long pad;
	char prev_comm[16];
	int prev_pid;
	int prev_prio;
	long long prev_state;
	char next_comm[16];
	int next_pid;
	int next_prio;
    };

The pad field exists because the first four fields (those common to all tracepoints) are not accessible to BPF programs. The rest, however, can be accessed by name in a C program (which will be compiled to BPF and loaded into the kernel). This program will likely extract the data of interest from this structure, process it in its own special way, and store the result in a BPF map; user space can then access the result.

As with the other BPF program types, the helper code supplied with the kernel uses section names as a directive for what should be done with a specific program. So a program meant to be attached to a tracepoint should be explicitly placed in a section called "tracepoint/name", where "name" is the name of the tracepoint of interest. So, for the sample program, the section name is "tracepoint/sched/sched_switch".

The mechanism works, and, importantly, a tracepoint-attached BPF program is quite a bit more efficient than placing a kprobe and attaching a program there. There are already tools in development (argdist, for example) that will create BPF programs for specific tasks; argdist will create a program to make a histogram of the values of a given tracepoint field. All told, it looks like a useful advance in the kernel's instrumentation.

There is a potential catch, though: the old issue of tracepoints and ABI stability. Tracepoints expose the inner workings of the kernel, which suggests that they must change if the kernel does. Changing tracepoints can, however, break applications that use them; this is an issue that has come up many times in the past. It is also why certain subsystems (the virtual filesystem layer, for example) do not allow tracepoints at all: the maintainers are worried that they might be unable to make important changes because they may break applications dependent on those tracepoints.

For user-space programs, the issue has been mitigated somewhat by providing library code to access tracepoint data. An application that uses these utilities should be portable across multiple kernel versions. BPF programs, though, do not have access to such libraries and will break, perhaps silently, if the tracepoints they are using are changed. ABI concerns have stalled the merging of this capability in the past, but there was little discussion of ABI worries this time around. Alexei maintains that the interface available to BPF programs is the same as that seen by user-space programs, so there should be no new ABI worries. Whether the BPF interface truly brings no new ABI issues is something that will have to be seen over the coming years.

And it does appear that we will have the chance to see how that plays out; David Miller has applied the patches to the net-next tree, meaning that they should reach the mainline in the 4.7 merge window. Users wanting more visibility into what's happening inside the kernel will likely be happy to have it.

Comments (5 posted)

Toward less-annoying background writeback

By Jonathan Corbet
April 13, 2016

It's an experience many of us have had: write a bunch of data to a relatively slow block device, then try to get some other work done. In many cases, the system will slow to a crawl or even appear to freeze for a while; things do not recover until the bulk of the data has been written to the device. On a system with a lot of memory and a slow I/O device, getting things back to a workable state can take a long time, sometimes measured in minutes. Linux users are understandably unimpressed by this behavior pattern, but it has been stubbornly present for a long time. Now, perhaps, a new patch set will improve the situation.

That patch set, from block subsystem maintainer Jens Axboe, is titled "Make background writeback not suck." "Background writeback" here refers to the act of flushing block data from memory to the underlying storage device. With normal Linux buffered I/O, a write() call simply transfers the data to memory; it's up to the memory-management subsystem to, via writeback, push that data to the device behind the scenes. Buffering writes in this manner enables a number of performance enhancements, including allowing multiple operations to be combined and enabling filesystems to improve layout locality on disk.

So how is it that a performance-enhancing technique occasionally leads to such terrible performance? Jens's diagnosis is that it has to do with the queuing of I/O requests in the block layer. When the memory-management code decides to write a range of dirty data, the result is an I/O request submitted to the block subsystem. That request may spend some time in the I/O scheduler, but it is eventually dispatched to the driver for the destination device. Getting there requires passing through a series of queues.

The problem is that, if there is a lot of dirty data to write, there may end up being vast numbers (as in thousands) of requests queued for the device. Even a reasonably fast drive can take some time to work through that many requests. If some other activity (clicking a link in a web browser, say, or launching an application) generates I/O requests on the same block device, those requests go to the back of that long queue and may not be serviced for some time. If multiple, synchronous requests are generated — page faults from a newly launched application, for example — each of those requests may, in turn, have to pass through this long queue. That is the point where things appear to just stop.

In other words, the block layer has a bufferbloat problem that mirrors the issues that have been seen in the networking stack. Lengthy queues lead to lengthy delays.

As with bufferbloat, the answer lies in finding a way to reduce the length of the queues. In the networking stack, techniques like byte queue limits and TCP small queues have mitigated much of the bufferbloat problem. Jens's patches attempt to do something similar in the block subsystem.

Taming the queues

Like networking, the block subsystem has queuing at multiple layers. Requests start in a submission queue and, perhaps after reordering or merging by an I/O scheduler, make their way to a dispatch queue for the target device. Most block drivers also maintain queues of their own internally. Those lower-level queues can be especially problematic since, by the time a request gets there, it is no longer subject to the I/O scheduler's control (if there is an I/O scheduler at all).

Jens's patch set aims to reduce the amount of data "in flight" through all of those queues by throttling requests when they are first submitted. To put it simply, each device has a maximum number of buffered-write requests that can be outstanding at any given time. If an incoming request would cause that limit to be exceeded, the process submitting the request will block until the length of the queue drops below the limit. That way, other requests will never be forced to wait for a long queue to drain before being acted upon.

In the real world, of course, things are not quite so simple. Writeback is not just important for ensuring that data makes it to persistent storage (though that is certainly important enough); it is also a key activity for the memory-management subsystem. Writeback is how dirty pages are made clean and, thus, available for reclaim and reuse; if writeback is impeded too much, the system could find itself in an out-of-memory situation. Running out of memory can lead to other user-disgruntling delays, along with unleashing the OOM killer. So any writeback throttling must be sure to not throttle things too much.

The patch set tries to avoid such unpleasantness by tracking the reason behind each buffered-write operation. If the memory-management subsystem is just pushing dirty pages out to disk as part of the regular task of making their contents persistent, the queue limit applies. If, instead, pages are being written to make them free for reclaim — if the system is running short of memory, in other words — the limit is increased. A higher limit also applies if a process is known to be waiting for writeback to complete (as might be the case for an fsync() call). On the other hand, if there have been any non-writeback requests within the last 100ms, the limit is reduced below the default for normal writeback requests.

There is also a potential trap in the form of drives that do their own write caching. Such drives will indicate that a write request has completed once the data has been transferred, but that data may just be sitting in a cache within the drive itself. In other words, the drive, too, may be maintaining a long queue. In an attempt to avoid overfilling that queue, the block layer will impose a delay between write operations on drives that are known to do caching. That delay is 10ms by default, but can be tweaked via a sysfs knob.

Jens tested this work by having one process write 100MB each to 50 files while another process tries to read a file. The reading process will, on current kernels, be penalized by having each successive read request placed at the end of a long queue created by all those write requests; as might be expected, it performs poorly. With the patches applied, the writing processes take a little longer to complete, but the reader runs much more quickly, with far fewer requests taking an inordinately long period of time.

This is an early-stage patch set; it is not expected to go upstream in the near future. Patches that change memory-management behavior can often cause unexpected problems with different workloads, so it takes a while to build confidence in a significant change, even after the development work is deemed to be complete (which is not the case here). Indeed, Dave Chinner has already reported a performance regression with one of his testing workloads. The tuning of the queue-size limits also needs to be made automatic if possible. There is clearly work still to be done here; the patch set is also likely to be a subject of discussion at the upcoming Linux Storage, Filesystem, and Memory-Management Summit. So users will have to wait a bit longer for this particular annoyance to be addressed.

Comments (36 posted)

Static code checks for the kernel

By Nathan Willis
April 13, 2016

ELC

At the 2016 Embedded Linux Conference in San Diego, Arnd Bergmann presented a session on what he called a "lighter topic," his recent efforts to catch and fix kernel bugs through static tests. Primarily, his method involved automating a large number of builds, first to catch compilation errors that caused build failures, then to catch compiler warning messages. He has done these builds for years, progressively fixing the errors and then the warnings for a range of kernel configurations.

There are two motives for this particular side project, he said: to help automate the testing of the many pull requests seen in the arm-soc tree (for which the sheer number of SoCs presents a logistical challenge), and to put significant code-refactoring work to the test. Previously, he explained, he had attempted to review every pull request in arm-soc and fix every regression, but that quickly proved too time-consuming to be done manually. Testing each patch automatically first reduced the time required. As for refactoring, he noted that he was a veteran of the big kernel lock removal days and was now helping out with the effort to implement year-2038 compliance. In both cases, the refactoring touched hundreds of separate drivers, which can mean a glut of regressions.

Broadly speaking, he said, there are two approaches to testing scores of builds. One can either record all known warnings and send an email whenever a new warning appears, or one can try to eliminate all known warnings. Bergmann has opted for the second approach, running a near-constant stream of kernel builds, and creating a patch for every compiler warning he sees. At present, he reported, there are about 500 such patches, most of them tiny. He is currently automating builds with a script he wrote that creates a random kernel configuration and attempts a build. He is averaging 50 builds a day, almost all for 32-bit ARM, with occasional forays into 64-bit ARM and, rarely, other architectures.

Getting to this current state has taken some time. In 2011, he began by fixing all of the failures produced by running make defconfig (that is, "default configuration") and make allmodconfig (that is, configuring as many symbols to "module" as possible) builds in the arm-soc tree. By 2012, those failures were eliminated, so he set out to eliminate all compiler warnings produced by defconfig builds. By 2013, those warnings had been eliminated, and he began running his build tests with make randconfig—which creates a randomized kernel configuration. In 2013, he had eliminated all randconfig failures, and turned to eliminating the allmodconfig warnings. He began chipping away at randconfig warnings in mid-2014. Although that process is not yet complete, he has also begun to run build tests using the Clang compiler instead of GCC, which, as one would expect, generates entirely different errors and warnings.

The most common bugs he discovers with randconfig builds are missing dependency statements, he said, which cause necessary parts of the kernel to not get built. In particular, he cited missing Netfilter dependencies and ALSA codec dependencies as common, although he also noted that x86 developers seem to forget that, at least on ARM, I2C can be configured as a module and thus needs to be listed as a dependency if it is needed. The ALSA problems suggest that we need a better way to express codec dependencies, he said, although he conceded that kernel configurations are confusing in plenty of ways. For example, he showed this patch he had written:

    --- a/net/openvswitch/Kconfig
    +++ b/net/openvswitch/Kconfig
    @@ -7,7 +7,9 @@ config OPENVSWITCH
      depends on INET
      depends on !NF_CONNTRACK || \
           (NF_CONNTRACK && ((!NF_DEFRAG_IPV6 || NF_DEFRAG_IPV6) && \
    -               (!NF_NAT || NF_NAT)))
    +               (!NF_NAT || NF_NAT) && \
    +               (!NF_NAT_IPV4 || NF_NAT_IPV4) && \
    +               (!NF_NAT_IPV6 || NF_NAT_IPV6)))
      select LIBCRC32C
      select MPLS
      select NET_MPLS_GSO

and asked "what does it even mean for it to depend on NF_NAT or not NF_NAT?" The answer, he said, is that the test is being used to set an "is it a module or not" dependency for later usage, but it is hardly surprising that such syntax leads to bugs.

After "modules, modules, and more modules," Bergmann said, the next most common class of bugs he catches is uninitialized variables. He noted that Rusty Russell has written about how uninitialized variables are useful for error catching, but argued that they cause plenty of other errors. He showed a few examples, noting that often the flow of the code may mean that a reference to an uninitialized variable can never be reached, but he writes patches anyway to eliminate the warning. He also pointed out Steven Rostedt's patch to override if (for tracing purposes), saying it totally confused GCC, but that it helps to uncover quite a few bugs.

Next, Bergmann discussed some of the other code-checking tools available for kernel development, like scripts/checkpatch.pl and Sparse. Checkpatch looks for basic coding-style issues, he said, so while it is beneficial for submitting patches, it is not particularly valuable to run against existing code.

Sparse, however, makes use of annotations in the kernel source, therefore it can catch problems that GCC, with its lack of "domain-specific knowledge," simply cannot. Its big drawback is that it generates a lot of false positives. From the audience, Darren Hart noted that he uses Sparse regularly, but finds it problematic because it runs on complete files, rather than on patches alone. Therefore it tends to generate a lot of warnings that, upon inspection, were present in the original file and not the patch. Mauro Carvalho Chehab replied that some subsystem maintainers made an effort to remove all Sparse warnings in order to eliminate that particular problem, though far from all.

Bergmann also said he makes use of some extra GCC warnings to catch additional bugs. Kernel builds can employ a sort of "graduated" warning level thanks to work by Michael Marek: the W=12 switch includes all warnings from W=1 and W=2; W=123 adds the W=3 warnings as well. Using make W=1 is generally useful, he said, with W=12 adding little of value in a lot more noise, and W=123 being clear overkill, mostly due to an "explosion" of false positives. In the arm-soc tree, for instance, W=1 generates 631 instances of the most common warning, W=12 tops out at 94,235 for its top offender, and W=123 generates 782,719. The additional warnings of greatest interest to Bergmann include missing headers and missing prototypes. Bergmann also noted that he has recently run build tests with GCC 6, with promising results among the new warnings—so far, he has written 32 patches based on GCC 6 warnings. Most have already been applied.

Bergmann touched briefly on his experiments looking for build errors and warnings with Clang. That effort requires support from the LLVMLinux project, of course, and at the moment the patch set necessary to even compile the kernel with Clang is broken for mainline. But, since January (when he started his experiments), he has found "tons of new warnings." He eliminated the build errors found with Clang on randconfig builds, but has not yet tackled writing patches for the warnings. Clang also has a built-in static analyzer, he noted, which can produce rather nice-looking output and for which you can write your own checks, but he has not yet had the time to work with it.

Moving a bit further afield, he mentioned the proprietary Coverity scanning tool, for which Dave Jones has done "some amazing work" to record and annotate the known findings (which is necessary because Coverity requires manual categorization of the bugs it finds). The downside from Bergmann's perspective, though, is that Coverity is x86-only. He also pointed the audience to Julia Lawall's Coccinelle, which can do sophisticated pattern matching. He has worked with it for his own static checking, he said, though he has found it "really slow." Consequently, it is not a tool he would use in his own work, though he admitted he may be doing something wrong.

Another tool Bergmann does not use regularly, but that he cited for its "surprisingly good" warnings, is Dan Carpenter's Smatch. Carpenter has used it to catch thousands of bugs, he said, and pointed audience members to Carpenter's recent blog post for further information. Next, Bergmann highlighted the convenience of the 0day build bot maintained by Fengguang Wu; in addition to monitoring public Git trees, it recently started testing patch submissions and generating patches. And, finally, he noted the kernelci.org build-and-boot testing infrastructure. The most interesting part of the project for Bergmann is that the service is ARM-centric and the build farm includes a wide variety of machines.

By that point in the session, time had run out, so there was not much opportunity for the audience to ask questions. Nevertheless, it was surely an informative look at how static code checking benefits the arm-soc tree, where the ever-expanding list of supported hardware makes for a daunting maintainer workload. Furthermore, as Bergmann pointed out more than once, there are benefits to squashing warnings in addition to compilation errors, regardless of what code one is testing.

[The author would like to thank the Linux Foundation for travel assistance to attend ELC 2016.]

Comments (1 posted)

Linus Torvalds Linux 4.6-rc3 ?

Greg KH Linux 4.5.1 ?

Greg KH Linux 4.4.7 ?

Sebastian Andrzej Siewior v4.4.6-rt14 ?

Greg KH Linux 3.14.66 ?

Kamal Mostafa Linux 3.13.11-ckt38 ?

Shannon Zhao Add ACPI support for Xen Dom0 on ARM64 ?

Julien Grall arm64: Add support for KVM with ACPI ?

David Daney arm64, numa: Add numa support for arm64 platforms ?

Anshuman Khandual Enable HugeTLB page migration on POWER ?

Michael Ellerman Live patching for powerpc ?

Suravee Suthikulpanit KVM: x86: Introduce SVM AVIC support ?

Emese Revfy Introduce GCC plugin infrastructure ?

Alexei Starovoitov allow bpf attach to tracepoints ?

Marc Zyngier Partitioning per-cpu interrupts ?

Dave Hansen System Calls for Memory Protection Keys ?

Andrey Vagin task_diag: add a new interface to get information about processes (v3) ?

Bill Huey (hui) Cyclic Scheduler Against RTC ?

Richard W.M. Jones vfs: Define new syscall getumask. ?

Chunyan Zhang Introduce CoreSight STM support ?

Robin van der Gracht auxdisplay: Introduce driver for ht16k33 LED controller ?

Wadim Egorov Add RK818 PMIC support ?

Suravee Suthikulpanit iommu/AMD: Introduce IOMMU AVIC support ?

Enric Balletbo i Serra Add ANX7814 I2C bridge driver ?

Jan Glauber i2c-octeon and i2c-thunderx drivers ?

Xinliang Liu Add DRM Driver for HiSilicon Kirin hi6220 SoC ?

Chanwoo Choi [PATCH v9 00/20] PM / devferq: Add generic exynos bus frequency driver and new passive governor ?

Pankaj Dubey Add support for Exynos SROM Controller driver ?

Crestez Dan Leonard Support for max44000 Ambient and Infrared Proximity Sensor ?

Taku Izumi FUJITSU Extended Socket driver version 1.1 ?

Varun Prakash Chelsio iSCSI target offload driver ?

Jorge Ramirez-Ortiz MTK Smart Device Gen1 NAND Driver ?

Jose Abreu Add AXS10X I2S PLL clock driver ?

Tomeu Vizoso EC-based USB Power Delivery support for Chrome machines ?

Peter Griffin Add st-flashss vsense regulator driver ?

Songjun Wu [media] atmel-isc: add driver for Atmel ISC ?

Tiffany Lin Add MT8173 Video Decoder Driver ?

Baolin Wang Introduce usb charger framework to deal with the usb gadget power negotation ?

Noralf Trønnes drm: Add support for tiny LCD displays ?

Christoph Hellwig add RWF_(D)SYNC flag to preadv2/pwritev2 V2 ?

Christoph Hellwig iomap infrastructure and multipage writes V2 ?

Deepa Dinamani Add infrastructure to support vfs 64 bit timestamps ?

Anand Jain Btrfs: Introduce device state 'failed', spare device and auto replace ?

Brian Foster [RFC v2 PATCH 00/10] dm-thin/xfs: prototype a block reservation allocation model ?

Waiman Long vfs: Use per-cpu list for SB's s_inodes list ?

Darrick J. Wong fallocate for block devices ?

mchristi@redhat.com v5: separate operations from flags in the bio/request structs ?

Kirill A. Shutemov THP-enabled tmpfs/shmem using compound pages ?

Mel Gorman Optimise page alloc/free fast paths ?

Mel Gorman Move LRU page reclaim from zones to nodes v4 ?

David Howells RxRPC: 2nd rewrite part 1 ?

Kees Cook LSM: LoadPin for kernel file loading restrictions ?

Kees Cook kaslr: allow kASLR to be default over Hibernation ?

David Howells [RFC PATCH 00/12] KEYS: Restrict additions to 'trusted' keyrings [ver #4] ?

Thomas Garnier mm: SLAB freelist randomization ?

Dan Jurgens SELinux support for Infiniband RDMA ?

Karel Zak util-linux v2.28 ?

Kernel development

Brief items

Kernel release status

Quotes of the week

The linux-stable security tree project

Kernel development news

Tracepoints with BPF

Toward less-annoying background writeback

Taming the queues

Static code checks for the kernel

Patches and updates

Kernel trees

Architecture-specific

Build system

Core kernel code

Device drivers

Device driver infrastructure

Filesystems and block I/O

Memory management

Networking

Security-related

Miscellaneous