Kernel development [LWN.net]

Kernel release status

The current development kernel is 4.8-rc1, released on August 7. "This seems to be building up to be one of the bigger releases lately, but let's see how it all ends up. The merge window has been fairly normal, although the patch itself looks somewhat unusual: over 20% of the patch is documentation updates, due to conversion of the drm and media documentation from docbook to the Sphinx doc format."

Stable updates: the 4.6.6, 4.4.17, and 3.14.75 updates were released on August 10.

Comments (none posted)

Quotes of the week

Deferred probe is probably the best thing that ever happened for the quality of kernel error handling.

— Mark Brown

/me hands Andy a time machine to go fix this properly, before so much silicon ships.

— Borislav Petkov should hand out a few more of those.

Comments (none posted)

The end of the 4.8 merge window

By Jonathan Corbet
August 10, 2016

By the time Linus released 4.8-rc1 and closed the merge window for this development cycle, 11,618 non-merge changesets had found their way into the mainline repository. That suggests that 4.8 will be a relatively busy development cycle, but not busy enough to break any records. Just over 1,000 of those changesets were pulled after last week's summary was written; some of the more interesting changes in that last set include:

The Ceph filesystem now has full RADOS namespace support. This feature has been partially supported since 4.5; the final pieces were merged for 4.8.
The OrangeFS filesystem has better in-kernel caching support; see the pull-request text for more information.
The new printk.devkmsg command-line parameter can be used to control the ability of user space to send data to the kernel log via /dev/kmsg. The default setting of ratelimit applies rate limiting to data from user space. Other possibilities are on (allowing unlimited logging, as older kernels did) and off to disable logging from user space entirely.
M68k binaries built for systems without a memory-management unit can now be run on ordinary, MMU-equipped systems as well. That will help developers of such applications debug them on more powerful systems.
The new "software RDMA over Ethernet" driver allows the use of InfiniBand remote DMA protocols over the kernel's network stack.
Reverse-mapping support has been added to the XFS filesystem; this feature allows the filesystem code to track the ownership of every block on a storage device. Reverse mapping in its current form is not hugely useful, but it will be a core part of a set of intended XFS features for future development cycles; these features include reflink(), copy-on-write data, data deduplication, much-improved bad block reporting, and better recovery from filesystem damage. As Dave Chinner put it: "There's a lot of new stuff coming along in the next couple of cycles, and it all builds in the rmap infrastructure."
The architecture emulation containers feature has been merged; it allows containers to run code built for an architecture that differs from that of the host system.
The post-init read-only memory kernel-hardening feature now works with data in loadable modules as well.
The hardened usercopy patches were merged after the 4.8-rc1 release. This feature adds more checking to the kernel functions that copy data between kernel and user space with the idea of making them harder to exploit.
New hardware support includes: RapidIO channelized mailbox controllers, IDT RXS Gen.3 SRIO switches, IBM POWER virtual SCSI target servers, Maxim MAX6916 SPI realtime clocks, Silead I2C touchscreens, SiS 9200 family I2C touchscreens, Broadcom iProc PWM controllers, STMPE expander PWM controllers, ChromeOS EC PWM controllers, and J-Core J2 processors.

One thing that did not make it this time around, despite being pushed during the merge window, is the "latent entropy" GCC plugin. This program instruments various kernel functions in an attempt to generate some entropy from randomness in how the hardware responds, especially during that period early in the boot process when entropy may be in short supply. Linus was unimpressed by the pull request and unconvinced by the techniques used in the plugin itself. He has indicated that he might eventually take the plugin, but not right away, so this one looks like it will wait until the 4.9 development cycle.

If the usual schedule holds, the final 4.8 release will come out on September 25, which will place the 4.9 merge window during the Kernel Recipes and LinuxCon Europe conferences. That will thus be a busy time, but, between now and then, the work of testing this kernel and fixing the bugs needs to be done.

Comments (2 posted)

Four new Android privilege escalations

By Jake Edge
August 10, 2016

The "QuadRooter" vulnerabilities are currently making lots of headlines, at least partly because they could impact up to 900 million Android devices. There are four separate bugs, each with its own CVE number. Interestingly, all are found in code that lives outside of the mainline kernel—but is obviously shipped in a lot of devices.

QuadRooter, which was announced with great fanfare by Check Point Software Technologies, consists of privilege escalation vulnerabilities that could be used by malicious apps to take control of an Android device—and, of course, the personal data stored on it. The four bugs were found in drivers for Qualcomm system-on-chips (SoCs) that are found in many Android phone models, including the flagship Google Nexus 5X, 6, and 6P handsets. The bugs are serious, but users can mitigate the risk somewhat by avoiding dubious apps.

The bugs are detailed in a report [registration required] from Check Point. Note that unchecking the "please send me email" box on the registration form does not actually seem to stop Check Point from sending emails. The vulnerabilities are found in three different subsystems of the Qualcomm kernel: the ipc_router interprocess communication (IPC) module, the ashmem shared-memory allocation subsystem, and two bugs in the kernel graphics support layer (kgsl) that is used to render graphics provided by user-space programs. None of those modules is in the mainline kernel, though ashmem is in the staging tree but that version does not contain the function that caused the vulnerability.

For the most part, the bugs themselves are fairly standard kinds of flaws. The two in kgsl are use-after-free vulnerabilities, the ashmem bug provides a way to get attacker-controlled data into the kernel, while the ipc_router bug is a memory corruption that can lead to code execution. It is noteworthy that, because the code is out of the mainline, it probably didn't get the attention, testing, fuzzing, and review that it might otherwise have received—from the kernel development community, anyway. Given its prevalence in Android devices, though, it did garner some amount of attention, from Check Point, at least, and perhaps from others who are far less likely to report on what they found.

A look inside the flaws is instructive. CVE-2016-2059 is the ipc_router code-execution bug. The module provides a new address family (AF_MSM_IPC) that can be used to create sockets. Users can convert "client" sockets to "control" sockets by way of an ioctl() call. Unfortunately, the conversion function locks the wrong list, which allows (malicious) callers to corrupt a different list. Elements on that list can be made to point to freed memory, which the attacker can control using "heap spraying".

The report goes into some detail on how that corruption can be used to call arbitrary kernel functions with attacker-controlled parameters, which makes for interesting reading. But the upshot is clear: root privileges can be gained and SELinux disabled, which gives the attacker complete control over the device and its contents.

The first of the kgsl bugs (CVE-2016-2503) is caused by a race condition in the function used to destroy a "syncsource" object in the kgsl_sync subsystem, which synchronizes graphics data between user space and the kernel. If two or more threads call the function with the same syncsource, the reference count can be decremented incorrectly, leading to a negative count. That will allow attackers to control the memory contents of the object that the kernel still thinks is in use, which can then be used to execute code of the attacker's choosing. The recent reference count hardening work might help avoid reference-count underflows like this.

The second kgsl use-after-free (CVE-2016-2504) vulnerability is even easier to trigger. There is an ioctl() that allows users (or attackers) to directly free a specific kgsl_mem_entry object by its ID number, without any access control, which means that another thread can free the object while the kernel still has a reference to this newly freed object. The usual use-after-free games can be played at that point.

The bug (CVE-2016-5340) in ashmem, which is a memory allocator that allows processes to easily share memory, is a bit different. The Qualcomm version of ashmem has diverged from the one in staging, with some new functions provided to access the struct file from a file descriptor as long as the file is an ashmem shared-memory file. But the is_ashmem_file() function simply tests if the file name is /ashmem, which is the file name used by the subsystem. However, a, perhaps obscure, deprecated feature of Android, to allow for large files that accompany an app's .apk file, also allows apps to mount a filesystem with an ashmem entry in the root:

Attackers can use a deprecated feature of Android, called Obb to create a file named ashmem on top of a file system. With this feature, an attacker can mount their own file system, creating a file in their root directory called "ashmem."

By sending the fd of this file to the get_ashmem_file function, an attacker can trick the system to think that the file they created is actually an ashmem file, while in reality, it can be any file.

Thus a malicious app could fool the ashmem subsystem into using attacker-controlled data in what it thinks is a file with contents that are normally completely under its control.

Check Point has created a QuadRooter Scanner app that is available in the Google Play store. It scans an Android device and reports which, if any, of the vulnerabilities affect it. There is some skepticism about how good of a job it actually does, however. On my Nexus 6P, the scanner reports that the phone is vulnerable to CVE-2016-2504 and CVE-2016-5340, which were not reported as fixed in the July Android Security Bulletin—the phone is updated with the July 5 update.

That would seem to indicate that a recently purchased flagship phone is still vulnerable to two of the bugs, though the August bulletin does mention a fix for CVE-2016-2504, but there is no mention of CVE-2016-5340. That update has not been made available over Google's Project Fi carrier as of yet, however. According to the report, Qualcomm was informed about the bugs in April and it confirmed that it has released updated code to OEMs.

But, as we have seen rather often in the Android world, those fixes are taking some time to make their way out to users. Even users of Google's phones and network are awaiting some fixes. Other carriers and device makers tend to lag even further behind—or fail to ever get updates out at all. That leaves lots of phone owners in a tricky spot.

Users who are not running random side-loaded apps are likely to be less vulnerable to problems from QuadRooter, though. That is not to say it is impossible for a malicious app to slip into the Google Play store, but it is definitely less probable. The source of these kinds of malicious apps will be some dodgy app store that promises to deliver the latest exciting game or other app. Users of vulnerable phones should steer clear of such sites and generally try to be alert to odd behavior. That's good advice even well after QuadRooter is fixed on phones, as there are undoubtedly other, similar bugs lurking out there, both in the mainline and various vendor kernels.

Comments (none posted)

The NET policy mechanism

By Jonathan Corbet
August 10, 2016

One of the heuristics that guide kernel development says that, whenever possible, the addition of tuning knobs should be resisted. Such knobs are seen as the developer giving up and pushing a tuning problem onto users; instead, the kernel should, whenever possible, tune itself to suit the current workload. An attempt to reduce the user's tuning responsibilities for the networking subsystem is running into resistance, though.

Arguably, no part of the kernel offers more opportunities for user tuning than networking. Queuing disciplines and traffic control allow the creation of elaborate, in-kernel routing for packets. Interrupt affinities and device polling can be tweaked, there are numerous congestion-control algorithms to choose between, queue lengths and packet-ring sizes can be played with, and so on. There is also a whole set of policies and knobs that can be set within the network interfaces themselves. The result is a subsystem with a great deal of flexibility, but also one that is complex and difficult for most people to tune properly. Thus, many administrators do not even try if they can avoid it. Unfortunately, they often cannot avoid it; as Ken Liang noted in the introduction to his kernel NET policy patch set, "network performance is not good with default system settings."

That patch set introduces a new high-level policy mechanism; the administrator can use it to describe the sort of workload that the networking subsystem should be tuned for. The options are:

CPU: the most important factor is reducing the amount of CPU time needed to keep up with the network.
Latency: the latency of network communications should be kept to a minimum.
Throughput: the goal is to push the maximum amount of data through the network.

These policies may be set at a per-interface level, in which case they apply to all communications flowing through the affected interface. Policies can also be set on a per-task and per-socket level, though, allowing different users to operate under different policies. In this case, the interface-level policy must be set to the special "mixed" option; if the interface is given any other policy, all communications through that interface must match that policy.

Exactly how these policies are implemented is not well documented in the patch set; that is not helped by the fact that, in the current version, there are no driver-level patches implementing the new policy-setting hooks. That support can be seen in a previous version of the patch set; it was seemingly removed in response to complaints about the length of the series as a whole. Therein, one sees that much of the functionality is dependent on Intel's "Ethernet Flow Director" technology, though Liang maintains that it can be made to work on any adapter that supports loadable flow-direction rules — as many high-end adapters do.

One aspect of the policy implementation is interrupt mitigation. Most high-speed network adapters can handle vast numbers (as in millions) of packets per second; if they generated interrupts for every packet sent or received, the system would be swamped. So these adapters support various mechanisms for reducing the number of interrupts delivered. This is where the policy comes in: reducing the number of interrupts raised by the interface can increase the amount of time it takes to process a packet, thus increasing latency. So a latency-sensitive policy will tolerate more interrupts, while a CPU-conserving policy will reduce interrupts to a minimum.

Multi-queue devices (the only type supported by this patch set) can steer packets to specific queues and vary their interrupt behavior for each. Multiple queues can be used to support policy goals in other ways as well; throughput-oriented queues can be longer and run at lower priority, while latency-oriented queues should be high-priority and short. So the other aspect of the NET policy patches is queue-selection logic that depends on the policy attached to each packet. When a policy is established, the queues (and their CPU/interrupt affinities) are set up automatically, so the administrator need not deal with that sort of complexity.

It will surprise few readers to learn that a number of networking developers expressed concerns about this patch set. Policy implementation in the kernel is generally something that developers try to avoid; the kernel is meant to implement mechanism, leaving policy decisions to others. Given that most of what the NET policy patches do can already be done from user space, some questioned why the remaining bits weren't added to the API so that policy selection could be done outside of the kernel.

The answer to this question, as found in the cover letter to the series, goes something like this. User space does not have access to the same level of information that the kernel has, and the information that is available can be stale and subject to race conditions. If you do push these decisions out to user space, you'll add more context switches and slow down the system as a whole. And only the kernel can manage competing requests from multiple users in a way that's fair to all. The networking developers understand these arguments, but not everybody seems convinced that solving the problem in user space is impossible.

Also, perhaps inevitably, it was suggested that, rather than coding queue selection into the policy code, that decision could be made by an eBPF program loaded from user space. Using eBPF would certainly add flexibility to the system, but it seems unlikely to make the task of policy administration easier.

As things stand now, it seems clear that quite a bit more effort will be required to convince the network development community that the NET policy patches are the best solution to the problem. But the problem itself is real; as Stephen Hemminger put it, "network tuning is hard, most people get it wrong, and nobody agrees on the right answer." Creating a set of canned policies in the kernel may not be the best solution to the problem, but the real proof of that would be to come up with a better solution, and those seem to be in short supply at the moment.

Comments (none posted)

Linus Torvalds Linux 4.8-rc1 Aug 07

Greg KH Linux 4.6.6 Aug 10

Sebastian Andrzej Siewior 4.6.5-rt10 ?

Greg KH Linux 4.4.17 Aug 10

Greg KH Linux 3.14.75 Aug 10

Josh Poimboeuf x86/dumpstack: rewrite x86 stack dump code ?

Lina Iyer PM: SoC idle support using PM domains Aug 04

Dave Hansen [v6] System Calls for Memory Protection Keys Aug 08

Petr Mladek kthread: Kthread worker API improvements Aug 09

Peter Zijlstra [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact Aug 09

Chris Metcalf support "task_isolation" mode Aug 09

Yuyang Du Optimize sched avgs computation and implement flat util hierarchy Aug 10

Waiman Long locking/mutex: Enable optimistic spinning of lock waiter Aug 10

Steven Rostedt tracing: Add Hardware Latency detector tracer ?

Anurup M arm64:perf: Support for Hisilicon SoC Hardware event counters ?

Rich Felker J-Core interrupt controller support ?

Rich Felker J-Core SPI controller support ?

Rich Felker J-Core timer support ?

YT Shen MT2701 DRM support ?

HS Liao Mediatek MT8173 CMDQ support Aug 08

Tiffany Lin Add MT8173 Video Decoder Driver Aug 10

Chunfeng Yun Add MediaTek USB3 DRD Driver Aug 09

Noralf Trønnes drm: add SimpleDRM driver ?

Peter Senna Tschudin Add driver for GE B850v3 LVDS/DP++ Bridge ?

Timur Tabi [v7] net: emac: emac gigabit ethernet controller driver Aug 03

Keerthy mfd: lp873x: Add lp873x PMIC support Aug 08

Lucile Quirion Technologic I2C-FPGA gpio support Aug 08

Mirza Krak Add support for Tegra GMI bus controller Aug 06

Akinobu Mita iio: adc: add ADC12130/ADC12132/ADC12138 ADC driver Aug 08

Chunyan Zhang Integration of function trace with System Trace IP blocks Aug 09

Andre Przywara Allwinner MMC firmware clocks implementation Aug 09

Neil Armstrong Add Platform MHU mailbox driver for Amlogic GXBB Aug 09

Andrew F. Davis Add support for the TI SM-USB-DIG Aug 09

Chris Zhong Rockchip Type-C and DisplayPort driver Aug 09

Wadim Egorov Add support for rk818 Aug 10

Anup Patel Cache-coherent DMA access using UIO Aug 08

Peter Chen power: add power sequence library Aug 08

Jonathan Corbet [RFC] Sphinxify and coalesce development-tool documents Aug 08

Paolo Valente Replace the CFQ I/O Scheduler with BFQ Aug 08

Mike Rapoport userfaultfd: add support for shared memory ?

js1304@gmail.com Introduce ZONE_CMA Aug 09

Huang, Ying THP swap: Delay splitting THP during swapping out Aug 09

Vlastimil Babka make direct compaction more deterministic Aug 10

kan.liang@intel.com Kernel NET policy ?

Ursula Braun net/smc: Shared Memory Communications - RDMA Aug 09

Sargun Dhillon RFC: Add Checmate, BPF-driven minor LSM ?

Liang Li Extend virtio-balloon for fast (de)inflating & fast live migration Aug 08

Steve Dickson ANNOUNCE: nfs-utils-1.3.4 released. Aug 06

Stephen Hemminger iproute2 4.7.0 Aug 08

Kernel development

Brief items

Kernel release status

Quotes of the week

Kernel development news

The end of the 4.8 merge window

Four new Android privilege escalations

The NET policy mechanism

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Device driver infrastructure

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous