Kernel development

Brief items

Kernel release status

The 4.8 merge window is still open; see the separate article below for a list of what has been merged in the last week.

Stable updates: none have been released in the last week.

Comments (none posted)

I've been saying this from the start: we can't make use of all the capabilities of pmem [persistent memory] with existing filesystems and DAX. DAX is supposed to be a *stopgap measure* until pmem native solutions are built and mature. Finding limitations like the above only serve to highlight the fact DAX on ext4/XFS is only a partial solution.

The real problem is, as always, a lack of resources to implement everything we want to be able to do. Building a new filesystem is hard, takes a long time, and all the people we have that might be able to do it are fully occupied by maintaining and enhancing the existing Linux filesystems to support things like DAX or other functionality that users want (e.g. rmap, reflink, copy offload, etc).

— Dave Chinner

ABIs increase the utility of the kernel.

— Ingo Molnar

Comments (4 posted)

File permissions in the kernel

There are many ways to make a poor impression in the kernel community; Baole Ni surely stumbled across one of them: post 1,285 separate cleanup patches, each with the same subject line, and each copied to a long list of developers. It was, David Miller said, "one of the worst patch series submissions in history." In theory, the objective of the patch was reasonable: replace hard-coded constants with their symbolic equivalents. But, it seems, this is a case where the community would rather see the numbers directly.

The change in question relates to places in the kernel where file permissions are specified — usually permissions for files to be created in sysfs or /proc. There is a set of macros defined in <linux/stat.h> for these permissions bits, but it is common practice in the kernel — and among users of Unix-like systems in general — to just use their octal equivalent instead. Thus, for example, one will often see 0444 instead of S_IRUGO. Indeed, it seems one will see it at least 1,285 times, given the length of the patch set sent to eliminate octal permissions from the kernel.

There were obviously lots of complaints about how the patch set was done, but there was also a lot of opposition to the change itself. It seems that many people find a string like 0644 easier to read than S_IWUSR|S_IRUGO. In the end, Linus made that approach official, saying that he did not want to see any of the cleanup patches merged and that, in fact, it would be better to convert users of the macros to octal strings instead.

Octal constants are not perfect either; as Al Viro pointed out, they are subject to subtle and hard-to-see errors. Perhaps, it was suggested, the real problem is that the (POSIX-defined) S_* macros are hard to read, obscuring the developer's intent rather than clarifying it. As an alternative, Ingo Molnar has proposed the adoption of a new set of macros, defined like this:

    #define PERM_rw_______	0600
    #define PERM_rw_r_____	0640
    #define PERM_rw_r__r__	0644
    #define PERM_rw_rw_r__	0664
    #define PERM_rw_rw_rw_	0666

All of the "useful" combinations have macros defined, while there are none for settings that don't make sense. Use of these macros, he said, would make the code clearer and make it harder to introduce security problems. Actually getting them merged, though, might require overcoming the habits of developers who have been typing octal constants for decades. The eventual discussion could yet end up being longer than the patch series that provoked it.

Comments (20 posted)

Kernel development news

4.8 Merge window part 2

By Jonathan Corbet
August 3, 2016

As of this writing, Linus has pulled 10,589 non-merge changesets into the mainline repository; that is 7,433 since last week's summary. Clearly it has been a busy week. As is often the case, much of the work being merged takes the form of internal improvements that are not immediately visible to kernel users, but a number of interesting features have found their way in as well.

Some of the more significant user-visible features include:

The arm64 architecture has gained support for the kexec mechanism (allowing one kernel to boot directly into another) and kernel probes.
The TCP "New Vegas" congestion-control algorithm is now supported. New Vegas is a significant update to Vegas, adding better support for data-center settings in particular. See this document for details.
The mac80211 ("WiFi") layer has seen some interesting congestion-control changes. Normal queuing disciplines interact poorly with the frame aggregation mechanism used by wireless protocols, leading to poor performance, so the queuing discipline code has been disabled for mac80211. Instead, the mac80211 layer is now using the CoDel fair-queuing algorithm. This should be a significant step forward for better WiFi performance on Linux.
The reliable datagram sockets (RDS) protocol allows the creation of datagram-oriented connections over a TCP link. In 4.8, the RDS implementation can use multiple TCP connections to support RDS routing between two hosts, greatly increasing the maximum throughput. See this changelog for some details and a discussion of how this protocol differs from multipath TCP.
The "express data path" (XDP) work described in this article has moved forward. In 4.8, network drivers can define a hook allowing a BPF program to be loaded; that program will run on incoming packets before they even have internal data structures set up for them. The hook can indicate that packets should be dropped, but it also has the ability to do simple rewriting and forwarding. For some types of workloads, the result can be greatly increased performance without the need for kernel bypass techniques.
The kernel's pseudo-random number generator has been replaced with a new implementation using the ChaCha20 stream cipher. There have also been some changes made to address scalability problems when user-space programs are consuming massive amounts of random data.
The memory-management subsystem's page-reclaim mechanism has been fundamentally reworked to track pages based on NUMA nodes rather than on memory zones. As Mel Gorman noted in the patch posting, zone-based reclaim was important in the days of 32-bit systems with a lot of high memory but, now that large-memory systems are mostly running 64-bit kernels, node-based reclaim is a more suitable approach. Users should see little change beyond, hopefully, better performance; see the posting for a number of benchmark results.
A fair amount of work has been put in toward the goal of allowing unprivileged users to mount filesystems in user namespaces. That goal still depends on a number of remaining loose ends being addressed, though, and so will not be achieved in the 4.8 development cycle.
The kernel has gained support for the Common Architecture Label IPv6 Security Option (CALIPSO) standard. CALIPSO can be used to attach security labels to packets, making them subject to normal (SELinux or Smack) security policies.
The PowerPC64 architecture now has a just-in-time compiler for BPF programs.
New hardware support includes:
- Processors and systems: Artesyn MVME7100 single-board computers, R-Car V2H (R8A7792) systems-on-chip (SoCs), and Broadcom BCM23550 SoCs.
- Audio: Analog Devices ADAU7002 Stereo PDM-to-I2S/TDM converters, Cirrus Logic CS53L30 and CS35L33 codecs, Maxim MAX9860 mono audio voice codecs, Maxim MAX98504 speaker amplifiers, and Allwinner A10 I2S audio interfaces.
- Graphics: ARM Mali display processors, Silicon Image sii902x RGB/HDMI bridges, and Toshiba TC358767 eDP bridges.
- Input: Atmel capacitive touch buttons, Ntrig/Microsoft Surface 3 SPI touchscreens, Raydium I2C touchscreens, Pegasus Mobile Notetaker Pen input tablets, and Alps I2C HID touchpads and StickPointers.
- Miscellaneous: TI LP3952 2 channel LED controllers, Qualcomm Hexagon V5 peripheral image loaders, Marvell version 2 XOR engines, Xilinx ZynqMP DMA engines, R-Car R8A7796 clock pulse generators, Allwinner H3 clock-control units, AmLogic S905 clock controllers, PowerPC PowerNV PCI hotplug controllers, Aspeed 2400 watchdog timers, Maxim Max77620 watchdog timers, Amlogic Meson GXBB SoCs watchdog timers, Broadcom STB SDIO/SD/MMC host controllers, Broadcom PDC mailbox managers, Altera Arria10 DevKit system resource chips, Atmel external bus interface controllers, NVIDIA Tegra ACONNECT bus controllers, HiSilicon SPI-NOR flash controllers, MediaTek SDG1 NFC nand controllers, Atmel Quad SPI controllers, Cadence Quad SPI controllers, and Aardvark PCIe controllers.
- Networking: Freescale QUICC Engine HDLC controllers, Broadcom BCM53xx Ethernet switches, Broadcom Northstar2 PCIe PHYs, Intel XWAY PHYs, Renesas R-Car CAN FD controllers, Hisilicon fast Ethernet MAC controllers, and APM X-Gene SoC MDIO bus controllers.
- Pin control: Oxford Semiconductor OXNAS SoC family pin controllers, Maxim MAX77620/MAX20024 pin controllers, UniPhier PH1-LD11 and PH1-LD20 SoC pin controllers, Intel Merrifield pin controllers, Broadcom NSP pin controllers, Qualcomm 9615 pin controllers, and STMicroelectronics STM32F746 pin controllers.

Changes visible to kernel developers include:

The GCC plugin infrastructure patches have been merged, making it possible to use plugin modules to the compiler to modify how the kernel is built. As of this writing, plugins for coverage testing and calculation of cyclomatic complexity have been merged. The "latent entropy" plugin, which tries to generate entropy early in the bootstrap process, is in a pull request but has not been pulled as of this writing.
The new skb_array mechanism adds an array-based FIFO data structure for the queuing of network packets; see <linux/skb_array.h> for an overview of the API.
The task of reworking the CPU hotplug mechanism continues with the conversion of more notifiers to the new scheme. As Thomas Gleixner put it in the pull request: "Another 700 hundred line of unpenetrable maze gone".

The 4.8 merge window still has a few days to run, so expect a few more features to land before the 4.8-rc1 release comes out. Next week's Kernel Page will, of course, contain an update with the final changes to be merged for this development cycle.

Comments (9 posted)

Hardened usercopy

By Jake Edge
August 3, 2016

The kernel often copies data from and to user space, which makes copy_to_user() and copy_from_user() (and friends) rather frequently used kernel functions. But if the kernel can be tricked into copying too much data in either direction, security vulnerabilities can be the result. Long ago, grsecurity added the PAX_USERCOPY feature (created by the PaX team) to harden those calls, so that even poorly written code elsewhere in the kernel cannot truly copy more than it should. Code based on PAX_USERCOPY is now being proposed for inclusion into the mainline kernel.

Kees Cook posted the first version of his "hardened usercopy" patches in early July. The patches are based on some earlier work that Casey Schaufler had done to port the PAX_USERCOPY feature from grsecurity to the mainline. Essentially, it tries to ensure that address ranges used to copy data to and from user space are valid. Cook is also working on patches for two other parts of the PAX_USERCOPY feature; this piece is configured into the kernel with the CONFIG_HARDENED_USERCOPY option.

The main problems that can result from an errant user-space copy are either that too much data is copied to user space, resulting in leaking the contents of kernel memory, or that too much data is copied from user space, which can overwrite kernel memory. If an attacker can influence the allocation of objects on the kernel's heap and then overwrite some of those objects, they may be able to escalate privileges, run arbitrary code, or crash the kernel. Information leaks are generally less dangerous, but the kernel does have critical data (e.g. keys) that could be exposed. Beyond that, determining the layout of kernel memory by way of an information leak can also provide information needed to exploit other kernel flaws.

The patches add several tests of the arguments to the copy_*_user() functions, which have the following prototypes:

    long copy_from_user(void *to, const void __user * from, unsigned long n);
    long copy_to_user(void __user *to, const void *from, unsigned long n);

Each call involves a user-space pointer and a kernel-space pointer; the user-space pointers are already checked in current kernels, so the patches only add tests for the kernel-space pointers. Those tests ensure that the address range doesn't wrap past the end of memory, that the kernel-space pointer is not null, and that it does not point to a zero-length kmalloc() allocation (i.e. ZERO_OR_NULL_PTR() is false). Also, if the address range overlaps the kernel text (code) segment, it is rejected.

Beyond that, if the kernel-space address points into an object that has been allocated from the slab allocator, the patches ensure that what is being copied fits within the size of the object allocated. This check is performed by calling PageSlab() on the kernel address to see if it lies within a page that is handled by the slab allocator; it then calls an allocator-specific routine to determine whether the amount of data to be copied is fully within an allocated object. If the address range is not handled by the slab allocator, the patches will test that it is either within a single or compound page and that it does not span independently allocated pages.

In addition, for copies involving the stack, the copied range must fit within the current process's stack. If there is architecture support for identifying stack frames, the copied range must fit within a single frame.

In all cases, an address range that fails the tests will generate a log message with the pertinent information. It will also call BUG() to generate a kernel oops and kill the current process (i.e. the one that was trying to exploit a kernel hole of some kind).

The patch set is broken up into three logical chunks: the main patch that adds the tests, patches that enable the feature for specific architectures (originally, x86, arm, arm64, ia64, powerpc, and sparc, with s390 added in a more recent patch set), and two patches that add heap-checking support for the SLAB and SLUB allocators. Cook noted that the SLOB allocator support in grsecurity "seems entirely broken", so he focused on testing SLAB and SLUB. In addition, stack frame checking has only been implemented for x86.

Cook said that he "couldn't detect a measurable performance change with these features enabled", when running tests like kernel builds and hackbench. That suggested that the feature could be turned on by default at some point, though it is turned off by default for now. Ingo Molnar suggested running a system-call-heavy workload to see if that had any measurable performance degradation, as he would also like to see the feature on by default. Linus Torvalds said that a stat()-heavy workload (e.g. something like git diff) would be one way to test it, but indicated that he thought the checks would not be all that onerous.

Andy Lutomirski wondered if some of the infrastructure to validate the objects being copied should be given a different name, since it might be extended to more than just "usercopy" down the road. That set off a bit of a squabble between Molnar and PaX Team about the feature, threat models, and "bikeshedding". Cook, however, successfully tamped down the flickering flames:

There's a long history of misunderstanding and miscommunication (intentional or otherwise) by everyone on these topics. I'd love it if we can just side-step all of it, and try to stick as closely to the technical discussions as possible. Everyone involved in these discussions wants better security, even if we go about it in different ways. If anyone finds themselves feeling insulted, just try to let it go, and focus on the places where we can find productive common ground, remembering that any fighting just distracts from the more important issues at hand.

The patch set is in its fourth revision at this point; Cook has requested that it be pulled for 4.8. In the review process, some bugs have been fixed (notably some arm64 fixes and additions from Laura Abbott) and changes made, but no fundamental disagreement with the feature has emerged. As of this writing, the patches have not been pulled, but there were some prerequisites so it may simply be that Torvalds just hasn't gotten to it yet. But, if not for 4.8, it seems likely that we will see the feature appear in the mainline fairly soon.

Comments (none posted)

Statistics from the 4.7 development cycle

By Jonathan Corbet
August 2, 2016

The 4.7 kernel was released on July 24, so longtime readers might be wondering where the usual development statistics are. We're running a little late this time around, but for good reason — Greg Kroah-Hartman obtained information from a large number of developers on who they work for, and we're now able to use that information to produce better numbers. Of course, the overall story hasn't changed a whole lot — kernel development is relatively boring and predictable these days — but each cycle still has a few noteworthy points.

The 4.7 development cycle saw the merging of 12,283 changesets from 1,582 developers; 232 of those developers appeared in the kernel changelog for the first time. Those changes added just under 300,000 lines to the kernel source and 740 new files to the kernel tree. Of those developers, the most active were:

Most active 4.7 developers

By changesets

H Hartley Sweeten 208 1.7%

Boris Brezillon 132 1.1%

Al Viro 127 1.0%

Linus Walleij 121 1.0%

Geert Uytterhoeven 120 1.0%

Arnaldo Carvalho de Melo 110 0.9%

Ville Syrjälä 105 0.9%

Laxman Dewangan 101 0.8%

Arnd Bergmann 97 0.8%

Jes Sorensen 97 0.8%

Eric Dumazet 91 0.7%

Dan Carpenter 88 0.7%

Aneesh Kumar K.V 79 0.6%

Michal Hocko 74 0.6%

Chris Wilson 71 0.6%

Wolfram Sang 68 0.6%

Florian Westphal 66 0.5%

James Hogan 66 0.5%

Daniel Vetter 64 0.5%

Imre Deak 62 0.5%

By changed lines

Alex Deucher 37185 6.4%

Rex Zhu 19912 3.4%

Paul E. McKenney 14004 2.4%

Thierry Reding 9170 1.6%

Jinshan Xiong 8828 1.5%

Yuval Mintz 8419 1.4%

Jes Sorensen 6982 1.2%

Chanwoo Choi 5742 1.0%

H Hartley Sweeten 5705 1.0%

Varun Prakash 5703 1.0%

Boris Brezillon 5347 0.9%

Aneesh Kumar K.V 5230 0.9%

Tom Zanussi 5116 0.9%

CK Hu 5072 0.9%

Ilya Dryomov 4764 0.8%

Linus Walleij 4738 0.8%

Maxime Ripard 4631 0.8%

Mathieu Poirier 4559 0.8%

Christoph Hellwig 4232 0.7%

Finn Thain 4024 0.7%

By this point it should come as no surprise that H Hartley Sweeten made it to the top of the "by changesets" list with continued work on the Comedi drivers in the staging tree; nearly 8,400 patches have gone into that subsystem since it was merged. Boris Brezillon's work was mostly focused on the memory-technology devices subsystem (and NAND controllers in particular), Al Viro made a number of fundamental changes (including parallel lookups) to the virtual filesystem layer and followed the implications of those changes through many filesystems, Linus Walleij has been reworking the GPIO subsystem, and Geert Uytterhoeven worked all over the tree, with an emphasis on various ARM-related subsystems.

In the "lines changed" column, Alex Deucher continues to work on the massive amdgpu graphics driver; Rex Zhu is also working primarily on that driver. Paul McKenney works with the read-copy-update subsystem, of course; the elevated line count this time around results from some large documentation changes. Thierry Reding works with the NVIDIA Tegra ARM subarchitecture, and Jinshan Xiong made some extensive changes to the Lustre filesystem in the staging tree.

Often work in the staging tree tends to overshadow everything else when it comes to these lists, but, this time around, only two developers who appear in the top ten on either side were working on staging code.

There were 222 companies (that we know about) that supported work merged in the 4.7 development cycle — a fairly average figure for recent years. The most active companies this time around were:

Most active 4.7 employers

By changesets

Intel 1786 14.5%

(None) 968 7.9%

Red Hat 967 7.9%

(Unknown) 861 7.0%

Linaro 633 5.2%

SUSE 470 3.8%

IBM 378 3.1%

AMD 302 2.5%

Samsung 276 2.2%

Google 244 2.0%

Renesas Electronics 244 2.0%

NVIDIA 231 1.9%

Mellanox 227 1.8%

Free Electrons 222 1.8%

ARM 217 1.8%

Vision Engraving Systems 208 1.7%

Oracle 200 1.6%

Imagination Technologies 193 1.6%

Texas Instruments 185 1.5%

Broadcom 141 1.1%

By lines changed

Intel 86056 14.8%

AMD 69065 11.8%

(None) 35035 6.0%

Red Hat 33887 5.8%

IBM 28102 4.8%

Linaro 23396 4.0%

(Unknown) 23287 4.0%

NVIDIA 18023 3.1%

Mellanox 14011 2.4%

Samsung 12918 2.2%

SUSE 12810 2.2%

Free Electrons 12637 2.2%

QLogic 11731 2.0%

ARM 9000 1.5%

Rockchip 8938 1.5%

Renesas Electronics 8734 1.5%

Texas Instruments 7462 1.3%

(Consultant) 6964 1.2%

Chelsio 6868 1.2%

Broadcom 6564 1.1%

This table looks as it has for some time, no real surprises here. The percentage of changes from developers working on their own time, at 7.9%, is up from 4.6, but still remains low by historical standards. Once upon a time, volunteer developers were our primary source of new contributors to the kernel. In 4.7, of the 232 first-time contributors, 132 were known to be employed at the time, 38 were known to be working on their own time, and 62 are in the "unknown" column. Even if all the unknowns are volunteers (most of them probably are), we still have more new contributors arriving via companies.

Contributing to the kernel used to be a fairly reliable way to get a job, and it probably still is. But, in 2016, it seems that many of our new developers get the job first, and it is the job that brings them to the kernel community.

The table above shows the changes contributed by the most active companies. One last question one might ask is: how many developers does each company have working on Linux? For the 4.7 development cycle, the answer looks like this:

# of developers/company

Company Count Percent

(Unknown) 238 14.5%

Intel 198 12.1%

(None) 172 10.5%

Red Hat 91 5.6%

IBM 64 3.9%

Google 48 2.9%

Linaro 43 2.6%

Mellanox 38 2.3%

SUSE 37 2.3%

AMD 30 1.8%

Samsung 27 1.6%

Huawei Technologies 27 1.6%

ARM 25 1.5%

Texas Instruments 23 1.4%

Broadcom 22 1.3%

Oracle 21 1.3%

NXP 20 1.2%

Qualcomm 17 1.0%

MediaTek 13 0.8%

Imagination Technologies 12 0.7%

Renesas Electronics 12 0.7%

Facebook 11 0.7%

NVIDIA 11 0.7%

Code Aurora Forum 10 0.6%

(Consultant) 10 0.6%

Rockchip 10 0.6%

Canonical 10 0.6%

Free Electrons 9 0.5%

Pengutronix 9 0.5%

Synopsys 8 0.5%

# of developers/company
Company	Count	Percent
(Unknown)	238	14.5%
Intel	198	12.1%
(None)	172	10.5%
Red Hat	91	5.6%
IBM	64	3.9%
Google	48	2.9%
Linaro	43	2.6%
Mellanox	38	2.3%
SUSE	37	2.3%
AMD	30	1.8%
Samsung	27	1.6%
Huawei Technologies	27	1.6%
ARM	25	1.5%
Texas Instruments	23	1.4%
Broadcom	22	1.3%
Oracle	21	1.3%
NXP	20	1.2%
Qualcomm	17	1.0%
MediaTek	13	0.8%
Imagination Technologies	12	0.7%
Renesas Electronics	12	0.7%
Facebook	11	0.7%
NVIDIA	11	0.7%
Code Aurora Forum	10	0.6%
(Consultant)	10	0.6%
Rockchip	10	0.6%
Canonical	10	0.6%
Free Electrons	9	0.5%
Pengutronix	9	0.5%
Synopsys	8	0.5%

Intel, it seems, has far more developers working on the kernel than any other company — nearly 12% of the total in 4.7. Volunteer developers may not contribute a lot of code, but there are quite a few of them; given that many (if not most) of the unknown developers probably fall into this category, developers working on their own time are still the biggest group.

The kernel community as a whole is a big group indeed, and it continues to produce kernels in a disciplined and predictable way. The relative lack of surprises may make for relatively boring statistics articles, but it is certainly welcome to users of the kernel.

Comments (32 posted)

Patches and updates

Kernel trees

Levin, Alexander Linux 4.1.29 ?

Levin, Alexander Linux 3.18.38 ?

Steven Rostedt 3.12.62-rt83 ?

Core kernel code

Dave Hansen [v6] System Calls for Memory Protection Keys ?

Lina Iyer PM: SoC idle support using PM domains ?

Development tools

Hari Bathini perf/tracefs: Container-aware tracing support ?

Josh Triplett [Ksummit-discuss] [ANNOUNCE] git-series: track changes to a patch series over time ?

Device drivers

Amir Levy thunderbolt: Introducing Thunderbolt(TM) networking ?

Chris Zhong Rockchip Type-C and DisplayPort driver ?

Enric Balletbo i Serra Add support for cros-ec-sensors ?

Kamal Dasu Broadcom stb, nsp, ns2, cygnus QSPI driver ?

Peter Griffin [PATCH v7 00/16] Add support for FDMA DMA controller and slim core rproc found on STi chipsets ?

Songjun Wu [media] atmel-isc: add driver for Atmel ISC ?

Anurup M [RFC PATCH v1 00/10] arm64:perf: Support for Hisilicon SoC Hardware event counters ?

Device driver infrastructure

Mitchel Humpherys Add support for privileged mappings ?

Noralf Trønnes drm: Add DRM text mode ?

Peter Chen power: add power sequence library ?

Filesystems and block I/O

Seth Forshee Support for posix acls in fuse ?

Memory management

Michal Hocko fortify oom killer even more ?

Networking

Stefan Hajnoczi Add virtio transport for AF_VSOCK ?

Tom Herbert strp: Stream parser for messages ?

Security-related

Dan Jurgens SELinux support for Infiniband RDMA ?

Elena Reshetova [PATCH 0/5] Hardchroot LSM + additional hooks ?

Page editor: Jonathan Corbet
Next page: Distributions>>