Kernel development [LWN.net]

Kernel release status

The current development kernel is 4.2-rc1, released on July 5. As Linus explains, 4.2 may, in the end, not end up being the development cycle with the most commits ever, but there is still a lot going on. "However, if you count the size in pure number of lines changed, this really seems to be the biggest rc we've ever had, with over a million lines added (and about a quarter million removed). That beats the previous champion (3.11-rc1) that was huge mainly due to Lustre being added to the staging tree." The source of the biggest chunk of those new lines is the new amdgpu graphics driver.

Stable updates: 3.14.47 and 3.10.83 were released on July 6. The 4.1.2, 4.0.8, 3.14.48, and 3.10.84 updates are in the review process as of this writing; they can be expected on or after July 10.

Comments (none posted)

Quotes of the week

I take it you have never seen the demonic glow in the eyes of a compiler implementer when thinking of all the code that can be broken^W^W^W^W^W optimizations that are enabled by relying on undefined behavior for signed integer overflow?

— Paul McKenney

Eventually the real crazy talk will begin and Jens will have to admit that he *still* hasn't open sourced his online shopping site for kernel developer inspired action figures.

— Chris Mason

Maintainers tend to get to be maintainers because they were good at something else, and not good enough at hiding from the "maintainer" role. There is a paradox here as a maintainer must be good at saying "No", but if they were they might never have agreed to become a maintainer.

— Neil Brown

Comments (none posted)

Kernel Summit 2015: Call for Proposals

The 2015 Kernel Summit will be held October 26-28 in Seoul, South Korea; the call for discussion proposals is out now. Now would be a good time for those who would like to attend the Summit to come up with a good topic and get the discussion going. Proposals are due by July 31.

Full Story (comments: none)

4.2 Merge window part 3

By Jonathan Corbet
July 7, 2015

By the time Linus released 4.2-rc1 and closed the merge window on July 5, 12,092 non-merge changesets had been pulled into the mainline kernel repository. That makes 4.2, by your editor's reckoning (but not Linus's — see below), the busiest merge window in the kernel project's history, beating the previous record holder, 3.15, by 58 commits. Even so, Linus doesn't believe that 4.2 will end up being the busiest development cycle for a simple reason: we have gotten better at fixing our code before it goes into the mainline, so fewer fixes are required thereafter. If one assumes that 3.15 had a higher fix rate than 4.2 will, then 4.2 should fall short of 3.15's total.

Such ideas are relatively easy to explore using the numbers, so here is the history from the last few years or so, showing non-merge changesets for each kernel release:

Release Merge
window Total %fixes

v3.0 7333 9153 19.9

v3.1 7202 8693 17.2

v3.2 10214 11881 14.0

v3.3 8899 10550 15.6

v3.4 9248 10899 15.1

v3.5 9534 10957 13.0

v3.6 8587 10247 16.2

v3.7 10409 11990 13.2

v3.8 10901 12394 12.0

v3.9 10265 11910 13.8

v3.10 11963 13637 12.3

v3.11 9494 10893 12.8

v3.12 9479 10927 13.3

v3.13 10518 12127 13.3

v3.14 10622 12311 13.7

v3.15 12034 13722 12.3

v3.16 11364 12804 11.2

v3.17 10872 12354 12.0

v3.18 9711 11379 14.7

v3.19 11408 12617 9.6

v4.0 8950 10346 13.5

v4.1 10659 11916 10.5

v4.2 12092 ? ?

Release	Merge window	Total	%fixes
v3.0	7333	9153	19.9
v3.1	7202	8693	17.2
v3.2	10214	11881	14.0
v3.3	8899	10550	15.6
v3.4	9248	10899	15.1
v3.5	9534	10957	13.0
v3.6	8587	10247	16.2
v3.7	10409	11990	13.2
v3.8	10901	12394	12.0
v3.9	10265	11910	13.8
v3.10	11963	13637	12.3
v3.11	9494	10893	12.8
v3.12	9479	10927	13.3
v3.13	10518	12127	13.3
v3.14	10622	12311	13.7
v3.15	12034	13722	12.3
v3.16	11364	12804	11.2
v3.17	10872	12354	12.0
v3.18	9711	11379	14.7
v3.19	11408	12617	9.6
v4.0	8950	10346	13.5
v4.1	10659	11916	10.5
v4.2	12092	?	?

Since the beginning of the 3.x series, the average kernel release has seen 13.6% of its changes pulled after the close of the merge window. In the time between the releases of 3.15 and 4.1, 71,416 changesets were merged, of which 8,452 — 11.8% — came outside of the merge window. So one might conclude that the amount of code arriving outside the merge window has fallen a bit in the last year. If the 11.8% rate holds this time around, 4.2 will finish with 13,709 changesets, 13 short of the total for 3.15.

So, it's possible that 3.15 will remain the busiest development cycle ever, but your editor must conclude that the jury is still out on this one.

In any case, the long-term trend is clear:

Over time, the kernel development community has indeed gotten better at merging code that does not require fixing later in the development cycle.

Final changes for 4.2

There were just over 1,200 non-merge changesets pulled into the mainline kernel repository since last week's summary. Among those were:

Large x86-based systems can now defer the initialization of much of main memory, speeding the boot process.
Some changes affecting how mounts of sysfs and /proc are managed have been merged. Subdirectories that are meant to serve as mount points (e.g. /sys/debug) are now marked as such, and mounts are limited to those directories. Beyond that, new rules have been added to ensure that new mounts of these filesystems (within a container, say) respect the mount flags used with existing mounts. The controversial enforcement of the noexec and nosuid flags has been removed for now, though.
Synopsys DesignWare ARC HS38 processors are now supported. Other new hardware support includes Dell airplane-mode switches, TI TLC59108 and TLC59116 LED controllers, Maxim max77693 LED flash controllers, Skyworks AAT1290 LED controllers, Broadcom BCM6328 and BCM6358 LED controllers, Kinetic Technologies KTD2692 LED flash controllers, TI CDCE925 programmable clock synthesizers, Hisilicon Hi6220 clocks, STMicroelectronics LPC watchdogs, Conexant Digicolor SoC watchdogs, Dialog DA9062 watchdogs, and Weida HiTech I2C touchscreen controllers,
The red-black tree implementation now supports "latched trees"; these maintain two copies of the tree structure in parallel and only modify one at a time. The end result is that non-atomic modifications can happen concurrently with lookups without creating confusion. See this commit for the implementation, and this one for some discussion of the latched technique. The first use of this technique is to accelerate module address lookups.

If recent patterns hold (and Linus doesn't take any more ill-timed vacations), the final 4.2 release can be expected on August 23.

A postscript

Some readers may be wondering why this article claims that 4.2 had the busiest merge window ever, given that Linus said otherwise in the 4.2-rc1 release announcement:

And it turns out v3.15-rc1 had more commits than 4.2-rc1 does (by a hair), so even there this isn't the biggest rc1 ever, if you count the number of commits.

The difference is that Linus is counting merge commits, while your editor does not. As mentioned above, there were 12,092 non-merge changesets pulled before 4.2-rc1, but that number grows to 12,809 changesets when merges are counted; that falls just short of the total (12,826) for 3.15-rc1. Your editor's reasoning for leaving out merges is that they mostly just represent the movement of patches from one branch to another and, thus, differ from "real" development approaches. No doubt others will have different opinions, though.

Comments (2 posted)

Restartable sequences

By Jonathan Corbet
July 7, 2015

Concurrent code running in user space is subject to almost all of the same constraints as code running in the kernel. One of those is that cross-CPU operations tend to ruin performance, meaning that data access should be done on a per-CPU basis whenever possible. Unlike kernel code, though, user-space per-CPU code cannot enter atomic context; it, thus, cannot protect itself from being preempted or moved to another CPU. The restartable sequences patch set recently posted by Paul Turner demonstrates one possible solution to that problem by providing a limited sort of atomic context for user-space code.

Imagine maintaining a per-CPU linked list, and needing to insert a new element at the head of that list. Code to do so might look something like this:

    new_item->next = list_head[cpu];
    list_head[cpu] = new_item;

Such code faces a couple of hazards in a multiprocessing environment. If it is preempted between the two statements above, another process might slip in and insert its own new element; when the original process resumes, it will overwrite list_head[cpu], causing the loss of the item added while it was preempted. If, instead, the process is moved to a different CPU, it could get confused between each CPU's list or run concurrently with a new process on the original CPU; the result in either case would be a corrupted list and late-night phone calls to the developer.

These situations are easily avoidable by using locks, but locks are expensive even in the absence of contention. The same holds for atomic operations like compare-and-swap; they work, but the result can be unacceptably slow. So developers have long looked for faster alternatives.

The key observation behind restartable sequences is that the above code shares a specific feature with many other high-performance critical sections, in that it can be divided into two parts: (1) an arbitrary amount of setup work that can be thrown away and redone if need be, and (2) a single instruction that "commits" the operation. The first line in that sequence:

    new_item->next = list_head[cpu];

has no visible effect outside the process it is executing in; if that process were preempted after that line, it could just execute it again and all would be well. The second line, though:

    list_head[cpu] = new_item;

has effects that are visible to any other process that uses the list head. If the executing process has been preempted or moved in the middle of the sequence, that last line must not be executed lest it corrupt the list. If, instead, the sequence has run uninterrupted, this assignment can be executed with no need for locks or atomic instructions. That, in turn, would make it fast.

A restartable sequence as implemented by Paul's patch is really just a small bit of code stored in a special region of memory; that code implements both the setup and commit stages as described above. If the kernel preempts a process (or moves it to another CPU) while the process is running in that special section, control will jump to a special restart handler. That handler does whatever is needed to restart the sequence; often (as it would be in the linked-list case) it's just a matter of going back to the beginning and starting over.

The sequence must adhere to some restrictions; in particular, the commit operation must be a single instruction and code within the special section cannot invoke any code outside of it. But, if it holds to the rules, a repeatable sequence can function as a small critical section without the need for locks or atomic operations. In a sense, restartable sequences can be thought as a sort of poor developer's transactional memory. If the operation is interrupted before it commits, the work done so far is simply tossed out and it all restarts from the beginning.

Paul's patch adds a new system call:

    int restartable_sequences(int op, int flags, long val1, long val2, long val3);

There are two operations that can be passed as the op parameter:

SYS_RSEQ_SET_CRITICAL sets the critical region; val1 and val2 are the bounds of that region, and val3 is a pointer to the restart handler (which must be outside of the region).
SYS_RSEQ_SET_CPU_POINTER specifies a location (in val1) of an integer variable to hold the current CPU number. This location should be in thread-local storage; it allows each thread to quickly determine which CPU it is running on at any time.

The CPU-number pointer is needed so that each section can quickly get to the correct per-CPU data; to emphasize that, the restart handler will not actually be called until this pointer has been set. Only one region for restartable sequences can be established (but it can contain multiple sequences if the restart handler is smart enough), and the region is shared across all threads in a process.

Paul notes that Google is using this code internally now; it was also discussed at the Linux Plumbers Conference [PDF] in 2013. He does not believe it is suitable for mainline inclusion in its current form, though. The single-region limitation does not play well with library code, the critical section must currently be written in assembly, and the interactions with thread-local storage are painful. But, he thinks, it is a reasonable starting place for a discussion on how a proper interface might be designed.

Paul's patch is not the only one in this area; Mathieu Desnoyers posted a patch set with similar goals back in May. Given Linus's reaction, it's safe to say that Mathieu's patch will not be merged anytime soon, but Mathieu did achieve his secondary goal of getting Paul to post his patches. In any case, there is clearly interest in mechanisms that can improve the performance of highly concurrent user-space code, so we will almost certainly see more patches along these lines in the future.

Comments (8 posted)

Deferred memory locking

By Jonathan Corbet
July 8, 2015

The mlock() and mlockall() system calls are charged with locking a portion (or all) of a process's address space into physical memory. The most common use cases for this functionality are situations where the latency of a page fault cannot be afforded and protecting sensitive data (cryptographic keys, say) from being written out to the swap device. Both system calls assume that the caller wants all of the memory present and locked immediately, but that may not always be the case. As a result, we are likely to see new versions of the memory-locking system calls in the near future.

The idea that a user who has requested the locking of a range of memory doesn't actually want it locked now may seem a little strange; that is what mlock() and mlockall() were created for, after all. The problem with immediate locking, as described by Eric Munson in his patch set, is that faulting in and locking a large address range can take a long time, and much of that time may be wasted if the calling process never actually uses much of that memory. If the cost of a page fault on the first access to a given page is not an issue, deferring the population and locking of a memory range can be a useful way to improve performance.

The cryptographic use case is one where deferred locking might make sense: the buffer to be locked may need to be able to handle a large worst case, but, most of the time, the portion of the buffer that's actually used is quite a bit smaller. If the pages that make up that buffer could only be locked after they are first faulted in, the objective of preventing writeout to the swap device will be met with lower overhead overall. Eric also mentions programs that use small parts of a large buffer, but which cannot know from the outset which parts will be used.

The solution in both cases is to modify mlock() so that it does not fault in all of the pages in the indicated address range. Instead, the range is simply marked as "lock on fault." Whenever a page within that range is faulted in, it will be locked from then on.

The problem is that mlock() has this prototype:

    int mlock(const void *addr, size_t len);

There is no way to tell the kernel to not fault the pages in immediately. The natural response is to create a new system call that has a feature that arguably should have been present in mlock() in the first place: a "flags" argument:

    int mlock2(const void *addr, size_t len, int flags);

The flags argument has two possibilities: MLOCK_LOCKED (to fault in the pages immediately) or MLOCK_ONFAULT (which only locks pages once they are faulted in). Exactly one of those flags must be present in any mlock2() call.

The mlockall() system call does already have a flags argument; the new MCL_ONFAULT flag has been added to request the new behavior via that interface. There is also a new flag (MAP_LOCKONFAULT) that can be used to get locked-on-fault behavior when creating an address range with mmap().

Eric's patch set adds new versions of the corresponding unlock system calls:

    int munlock2(const void *addr, size_t len, int flags);
    int munlockall2(int flags);

These system calls have the effect of clearing the given flags; the actual unlocking of memory is a side effect if all the flags are cleared. If a region has been locked with MLOCK_ONFAULT, one can call:

    munlock2(addr, len, MLOCK_ONFAULT);

to cancel the on-fault locking in the future while leaving currently locked pages in place, or:

    munlock2(addr, len, MLOCK_LOCKED|MLOCK_ONFAULT);

to unlock the address range entirely. It is not entirely clear (to your editor, at least) what will happen if munlock2() is called with just the MLOCK_LOCKED flag in this situation. Similar things can be done with munlockall2(); in this case, it is also possible to clear existing flags like MCL_FUTURE.

This patch set has been through a few iterations over the last few months. It has taken Eric a bit of work to convince reviewers of the value of this functionality; review comments also led to the addition of the new system calls (as opposed to just the new mmap() and mlockall() flags). This patch set has found its way into the -mm tree, which is a good sign that it's likely to head toward the mainline sometime in the relatively near future.

Comments (none posted)

Linus Torvalds Linux 4.2-rc1 ?

Steven Rostedt 3.18.17-rt14 ?

Greg KH Linux 3.14.47 ?

Steven Rostedt 3.14.46-rt46 ?

Greg KH Linux 3.10.83 ?

Steven Rostedt 3.10.82-rt89 ?

Eric Auger ARM IRQ forward control based on IRQ bypass manager ?

shannon.zhao@linaro.org KVM: ARM64: Add guest PMU support ?

Marc Zyngier arm64: Virtualization Host Extension support ?

Alexander Shishkin Introduce Intel Trace Hub support ?

Max Filippov Support hardware perf counters on xtensa ?

He Kuang Make eBPF programs output data to perf event ?

Andrey Vagin kernel: add a netlink interface to get information about processes (v2) ?

Eric Auger IRQ bypass manager and irqfd consumer ?

Chunyan Zhang Integration of trace events with System Trace IP blocks ?

Morten Rasmussen sched: Energy cost model for energy-aware scheduling ?

Josh Poimboeuf Compile-time stack validation ?

Andy Shevchenko mfd: introduce a driver for LPSS devices on SPT ?

Masahiro Yamada This series adds pinctrl drivers for UniPhier SoC family. ?

YH Huang Add MediaTek display PWM driver ?

Ludovic Desroches introduce driver for the Atmel SDMMC ?

LABBE Corentin [PATCH v10] crypto: Add Allwinner Security System crypto accelerator ?

David Lin Add new mac80211 driver mwlwifi. ?

Adam Thomson Add support for DA9150 Fuel-Gauge ?

Vivien Didelot net: dsa: mv88e6xxx: add support for VLAN Table Unit ?

Peter Griffin Add support for FDMA DMA controller found on STi chipsets ?

Lee Jones regulator: pwm-regulator: Introduce continuous-mode ?

Marc Zyngier Introducing per-device MSI domain ?

Jens Wiklander generic TEE subsystem ?

Mike Kravetz hugetlbfs: add fallocate support ?

mlin@kernel.org simplify block layer based on immutable biovecs ?

Ming Lin block: make generic_make_request handle arbitrarily sized bios ?

Tejun Heo block, cgroup: make cfq charge async IOs to the appropriate blkcgs ?

Vlastimil Babka Outsourcing compaction for THP allocations to kcompactd ?

Gioh Kim enable migration of driver pages ?

Sergey Senozhatsky mm/zsmalloc: introduce automatic pool compaction ?

Eric B Munson Allow user to request memory to be locked on page fault ?

Matteo Croce add stealth mode ?

Lawrence Brakmo tcp: add NV congestion control ?

David Ahern Proposal for VRF-lite - v2 ?

Benjamin Gaignard RFC: Secure Memory Allocation Framework ?

Jarkko Sakkinen Basic trusted keys support for TPM 2.0 ?

Paul Osmialowski Introduce LSM to KDBUS ?

Douglas Gilbert sg3_utils-1.41 available ?

Wang Nan perf tools: filtering events using eBPF programs - part1 ?

Kernel development

Brief items

Kernel release status

Quotes of the week

Kernel Summit 2015: Call for Proposals

Kernel development news

4.2 Merge window part 3

Final changes for 4.2

A postscript

Restartable sequences

Deferred memory locking

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Device driver infrastructure

Filesystems and block I/O

Memory management

Networking

Security-related

Miscellaneous