|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The current development kernel is 4.4-rc1, released on November 15. Linus said: "Just looking at the patch itself, things look fairly normal at a high level, possibly a bit more driver-heavy than usual with about 75% of the patch being drivers, and 10% being architecture updates. The remaining 15% is documentation, filesystem, core networking (as opposed to network drivers), tooling and some core infrastructure."

Stable updates: none have been released in the last week.

Comments (none posted)

A report from the Seoul media workshop

A detailed report has been posted from the Kernel Summit Media Workshop, held October 26 in Seoul. "We had 21 attendees from various companies and places in the world gather to discuss the current state of Linux Media and the challenges that need to be overcome to push these technologies into the future. This article will cover the major topics that were discussed during this workshop and the decisions that were made about the direction of this community."

Comments (none posted)

4.4 Merge window, part 2

By Jonathan Corbet
November 18, 2015
By the time that the first 4.4 merge-window article was written, most of the action was done; a mere 700 non-merge changesets were pulled into the mainline repository between then and the release of 4.4-rc1 on November 15. Many of those changes were fixes that wandered in during the merge window itself, leaving just a couple of changes worthy of note here:
  • There is a new "devfreq cooling" mechanism for the thermal management of devices. On properly equipped hardware, this framework can put an overheating device into a lower-power mode to keep its operating temperature within bounds.

  • The pulse-width modulator (PWM) tree was pulled, adding support for Renesas R-Car PWM controllers, Marvell Berlin PWM controllers, Broadcom BCM7038 PWM controllers, and MediaTek display PWM controllers.

All told, 11,528 changesets found their way into the mainline during the 4.4 merge window. That makes 4.4 a busy development cycle relative to its immediate predecessors:

Merge-window changes
ReleaseChanges
4.08,950
4.110,659
4.212,092
4.310,756
4.411,528

If the usual 63-day schedule holds, we can expect the final 4.4 release to happen on January 3, 2016, though it is always possible that the holiday season might slow things down a bit. Even if we don't get 4.4 as a new year's present, it should show up soon thereafter.

Comments (3 posted)

Kernel development news

A return to restartable sequences

By Jonathan Corbet
November 18, 2015
Once upon a time, highly concurrent programming was something that only a small subset of kernel developers needed to worry about. As the number of cores found in a CPU package grows, though, concurrency concerns are moving out to user space. The concerns of interest here are not just the protection of critical sections; user space has had to be able to do that for many years. A different level of worry comes to the fore at levels of concurrency where the overhead of locking becomes a significant performance issue in its own right. That's when developers start to think about lockless algorithms, which bring unique challenges in user space.

In the kernel, lockless programming tends to be tricky, leading to code that can be brittle if the data-access rules are not well understood or observed. But kernel code has a distinct advantage over user-space code in this regard: it is able to ensure that critical-section code can run to completion without being preempted. As long as the code restricts itself to per-CPU data structures, running with preemption disabled guarantees that no other thread will try to access those structures concurrently. User-space code has no such luxury; it always runs with preemption enabled. So any attempt to use per-CPU data structures in a lockless mode must use a different approach.

One such approach has been termed "restartable sequences"; the first patch enabling restartable sequences was examined here last July. A new patch set was posted alongside the kernel-summit session on restartable sequences in October. This patch features a different implementation and API that should address a number of the worries raised by the first attempt.

A restartable sequence is a brief segment of code performing some sort of lockless operation on a per-CPU data structure. A key rule is that the visible effects of a restartable sequence must be made by a single instruction at the very end of the sequence. Imagine, for example, the following (simplistic) code removing the head item from a linked list:

    struct list_thingie *item, *new_head;

    item = percpu_thingie_list_head;
    new_head = item->next;
    percpu_thingie_list_head = new_head;

The final line is the only operation that would be visible to other threads running on the same CPU. The operation could be interrupted anytime before that assignment without any ill effects — assuming the interrupted thread did not actually try to use item, of course. That final line could also be implemented as a single instruction. So this little fragment of code could meet the rules for a restartable sequence; properly implemented, it could allow multiple threads to remove items from a shared list without the need for locking.

There is one other thing that is needed, though, for a proper restartable sequence: some code to execute if the sequence happens to be interrupted partway through. In most cases (this one included), that code simply needs to restart the sequence from the beginning. With that in place, a restartable sequence can safely run in a lockless mode, but only if it can either (1) run to completion, or (2) know that it has been interrupted and jump to the failure code. That is where the need for kernel support comes in.

In the new patch, an application wanting to use restartable sequences needs to register two addresses with the kernel, using a new system call:

    int restartable_sequences(int flags, unsigned long *counter, void *post_commit);

(restartable_sequences() is the name used in the implementation; the associated test code calls it rseq(), though).

The flags argument is currently unused. The counter address is a location where the kernel stores a combination of the current CPU number and the current "event counter" — the number of times the thread has been preempted. The application should initially store NULL in the location pointed to by post_commit; later, when a restartable sequence is active, the application should store the address of the first instruction following the commit instruction there. Note that this call does not actually start a restartable sequence; instead, it sets up the infrastructure so that such sequences can be run.

To actually run a restartable sequence, the application thread must carefully do the following things, in order:

  1. Read the current CPU/event counter value from the counter address provided to the kernel above; this value must be stored for future use.

  2. Place the address to jump to should the sequence fail (i.e. if the thread is preempted while the sequence is running) where the kernel will find it. The actual location for this address is architecture-dependent; the x86_64 implementation wants it in the CX processor register.

  3. Load the address of the first post-commit instruction into the post_commit address provided above.

  4. Check the counter value again and ensure that it matches the value stored in the first step. If the two do not match, preemption has already occurred and the code should jump directly to the failure address

  5. Execute the critical section through the final commit instruction.

  6. Clear the post-commit instruction address stored in step 3.

The kernel's test for whether a restartable sequence is active is simple: is the current instruction pointer less than the address of the post-commit instruction stored in step 3? It is thus the storing of that address that begins the sequence for real; once that happens, the kernel will cause the thread to jump to its failure address if it is preempted. The manual check in step 4 is needed, though, in case preemption happened just before the execution of step 3. Performing the steps in this order ensures that there are no race conditions around the preemption checking.

In the previous version of the patch, the entire restartable sequence almost certainly needed to be written in assembly. This new interface does not eliminate the need for assembly code, but it does reduce the amount of that code considerably. A few instructions around and including the final commit must still be done in assembly, though that can probably be hidden in library code for a number of common use cases.

The previous version of the patch required the registration of one memory area that would hold critical-section code. With this version, instead, the critical section(s) can appear anywhere. Library code could almost use this feature independently of other application code, with one exception: the two addresses passed to the restartable_sequences() call must be shared by all users. The alternative would be to make a new restartable_sequences() call prior to beginning each sequence, but that is likely to run fairly strongly counter to the performance objectives that motivated the use of restartable sequences in the first place.

Discussion of this version of the patch set has been muted; perhaps it got lost in all the other kernel summit activity. Interest in this feature clearly goes beyond Google (where the patch originates), though. One would thus expect this feature to eventually enter the kernel in some form. Some of the implementation concerns from last time around have been addressed; the impact on the scheduler has been reduced, for example. Whether it will take another iteration or two to get the user-space interface right remains to be seen, though.

Comments (27 posted)

Persistent BPF objects

By Jonathan Corbet
November 18, 2015
With the addition of the bpf() system call in the 3.18 development cycle, user space gained the ability to load extended BPF programs into the kernel and to share data areas (called "maps") with them. The 4.4 kernel will take things further by making it possible for unprivileged processes to perform BPF operations. As interest in using BPF increases, though, some of the limitations of the initial design are starting to show through; one of those is the inability to create BPF objects (programs or maps) that outlive the process that creates them. That particular shortcoming will be addressed by another patch set, also merged for 4.4.

The original thinking behind the lifecycle of BPF objects was that they would be created and used by a single process. Current uses, though, are stretching that model. The network traffic-control subsystem, for example, may want to attach both classification and dispatching BPF programs to a traffic policy; that policy should then live after the creating invocation of tc has exited. Tracing applications, too, may involve setting up programs and maps that should persist for a while.

In pre-4.4 kernels, the only way to make these objects persist is to ensure that some process keeps the file descriptor open. One can create a special daemon that functions as a shelf for file descriptors, then pass BPF objects to it over Unix-domain sockets, but this solution lacks elegance and could be difficult to secure. If there is a true use case for persistent BPF objects, the kernel probably should support them directly.

One can say that, however, without answering the question of just how the persistence mechanism should work. In this case, it seems that the BPF developers considered just about every possible option. One could use a special FUSE filesystem to hold the file descriptors, but that really looks like a variant on the dedicated daemon idea. One could create a special namespace, the way network sockets or System IPC objects are handled, but the interface is awkward and inaccessible to shell scripts. For a while the developers even played with the idea of creating special devices for persistent BPF objects, but that idea went down on concerns of memory use and inability to play well with namespaces.

So what we have instead is yet another special kernel virtual filesystem. This one is meant to be mounted at /sys/fs/bpf. It is a singleton filesystem, meaning that it can be mounted multiple times within a single namespace and every mount will see the same directory tree. Each mount namespace will, however, get its own version of this filesystem. Within /sys/fs/bpf, a suitably privileged user can create and remove directories in the usual ways to set up a suitable directory hierarchy.

The "files" in this hierarchy, which represent persistent BPF objects, must be managed with the bpf() system call, though. The new BPF_PIN_FD bpf() command can be used to "pin" a file descriptor into the BPF filesystem; it takes a file descriptor corresponding to a BPF object and a path name as arguments. Once the BPF_PIN_FD call has succeeded, the associated BPF object will be made persistent and visible in the filesystem at the given path name. To unpin an object, ending its persistence, one need only remove the associated file in the usual way.

To access the persistent object, one must use another new bpf() command called BPF_GET_FD. It functions much like an open() call, in that it takes a path name and returns a file descriptor corresponding to that path. That file descriptor may then be used with other bpf() operations as needed.

Given that BPF_GET_FD looks like open(), one might well wonder why programs can't simply call open() instead. This was, evidently, a deliberate design decision; according to Alexei Starovoitov:

We've considered letting open() of the file return bpf specific anon-inode, but decided to reserve that for other more natural file operations. Therefore BPF_NEW_FD is needed.

(The BPF_NEW_FD command was present in an earlier version of the patch, but is not part of what was merged into 4.4).

The nature of these "more natural" operations was not laid out. There has been some discussion, though, of exposing BPF maps directly in the filesystem namespace. A map is essentially a key/value store, so one could consider representing it as a directory, with each key showing up as a "file" within it. The true value of this feature is not entirely clear, and it could get awkward when one considers that keys can be arbitrary binary data; they need not follow the rules that apply to file names. So it's perhaps not surprising that this feature is not present in the current patch set.

For the curious, the developers included an example program under samples/bpf. Now it is up to distributors to decide whether they want to mount /sys/fs/bpf by default, and for application developers to make use of this new capability.

Comments (2 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 4.4-rc1 ?
Sebastian Andrzej Siewior 4.1.13-rt15 ?
Kamal Mostafa Linux 3.19.8-ckt10 ?
Steven Rostedt 3.18.24-rt22 ?
Luis Henriques Linux 3.16.7-ckt20 ?
Steven Rostedt 3.14.57-rt58 ?
Steven Rostedt 3.12.50-rt68 ?
Steven Rostedt 3.10.93-rt101 ?
Steven Rostedt 3.4.110-rt138 ?
Ben Hutchings Linux 3.2.73 ?
Steven Rostedt 3.2.72-rt105 ?

Architecture-specific

Core kernel code

Device drivers

Device driver infrastructure

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous

Lucas De Marchi kmod 22 ?

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2015, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds