|
|
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The 4.8 merge window is still open; see the separate article below for a list of what has been merged in the last week.

Stable updates: none have been released in the last week.

Comments (none posted)

Quotes of the week

I've been saying this from the start: we can't make use of all the capabilities of pmem [persistent memory] with existing filesystems and DAX. DAX is supposed to be a *stopgap measure* until pmem native solutions are built and mature. Finding limitations like the above only serve to highlight the fact DAX on ext4/XFS is only a partial solution.

The real problem is, as always, a lack of resources to implement everything we want to be able to do. Building a new filesystem is hard, takes a long time, and all the people we have that might be able to do it are fully occupied by maintaining and enhancing the existing Linux filesystems to support things like DAX or other functionality that users want (e.g. rmap, reflink, copy offload, etc).

Dave Chinner

ABIs increase the utility of the kernel.
Ingo Molnar

Comments (4 posted)

File permissions in the kernel

There are many ways to make a poor impression in the kernel community; Baole Ni surely stumbled across one of them: post 1,285 separate cleanup patches, each with the same subject line, and each copied to a long list of developers. It was, David Miller said, "one of the worst patch series submissions in history." In theory, the objective of the patch was reasonable: replace hard-coded constants with their symbolic equivalents. But, it seems, this is a case where the community would rather see the numbers directly.

The change in question relates to places in the kernel where file permissions are specified — usually permissions for files to be created in sysfs or /proc. There is a set of macros defined in <linux/stat.h> for these permissions bits, but it is common practice in the kernel — and among users of Unix-like systems in general — to just use their octal equivalent instead. Thus, for example, one will often see 0444 instead of S_IRUGO. Indeed, it seems one will see it at least 1,285 times, given the length of the patch set sent to eliminate octal permissions from the kernel.

There were obviously lots of complaints about how the patch set was done, but there was also a lot of opposition to the change itself. It seems that many people find a string like 0644 easier to read than S_IWUSR|S_IRUGO. In the end, Linus made that approach official, saying that he did not want to see any of the cleanup patches merged and that, in fact, it would be better to convert users of the macros to octal strings instead.

Octal constants are not perfect either; as Al Viro pointed out, they are subject to subtle and hard-to-see errors. Perhaps, it was suggested, the real problem is that the (POSIX-defined) S_* macros are hard to read, obscuring the developer's intent rather than clarifying it. As an alternative, Ingo Molnar has proposed the adoption of a new set of macros, defined like this:

    #define PERM_rw_______	0600
    #define PERM_rw_r_____	0640
    #define PERM_rw_r__r__	0644
    #define PERM_rw_rw_r__	0664
    #define PERM_rw_rw_rw_	0666

All of the "useful" combinations have macros defined, while there are none for settings that don't make sense. Use of these macros, he said, would make the code clearer and make it harder to introduce security problems. Actually getting them merged, though, might require overcoming the habits of developers who have been typing octal constants for decades. The eventual discussion could yet end up being longer than the patch series that provoked it.

Comments (20 posted)

Kernel development news

4.8 Merge window part 2

By Jonathan Corbet
August 3, 2016
As of this writing, Linus has pulled 10,589 non-merge changesets into the mainline repository; that is 7,433 since last week's summary. Clearly it has been a busy week. As is often the case, much of the work being merged takes the form of internal improvements that are not immediately visible to kernel users, but a number of interesting features have found their way in as well.

Some of the more significant user-visible features include:

  • The arm64 architecture has gained support for the kexec mechanism (allowing one kernel to boot directly into another) and kernel probes.

  • The TCP "New Vegas" congestion-control algorithm is now supported. New Vegas is a significant update to Vegas, adding better support for data-center settings in particular. See this document for details.

  • The mac80211 ("WiFi") layer has seen some interesting congestion-control changes. Normal queuing disciplines interact poorly with the frame aggregation mechanism used by wireless protocols, leading to poor performance, so the queuing discipline code has been disabled for mac80211. Instead, the mac80211 layer is now using the CoDel fair-queuing algorithm. This should be a significant step forward for better WiFi performance on Linux.

  • The reliable datagram sockets (RDS) protocol allows the creation of datagram-oriented connections over a TCP link. In 4.8, the RDS implementation can use multiple TCP connections to support RDS routing between two hosts, greatly increasing the maximum throughput. See this changelog for some details and a discussion of how this protocol differs from multipath TCP.

  • The "express data path" (XDP) work described in this article has moved forward. In 4.8, network drivers can define a hook allowing a BPF program to be loaded; that program will run on incoming packets before they even have internal data structures set up for them. The hook can indicate that packets should be dropped, but it also has the ability to do simple rewriting and forwarding. For some types of workloads, the result can be greatly increased performance without the need for kernel bypass techniques.

  • The kernel's pseudo-random number generator has been replaced with a new implementation using the ChaCha20 stream cipher. There have also been some changes made to address scalability problems when user-space programs are consuming massive amounts of random data.

  • The memory-management subsystem's page-reclaim mechanism has been fundamentally reworked to track pages based on NUMA nodes rather than on memory zones. As Mel Gorman noted in the patch posting, zone-based reclaim was important in the days of 32-bit systems with a lot of high memory but, now that large-memory systems are mostly running 64-bit kernels, node-based reclaim is a more suitable approach. Users should see little change beyond, hopefully, better performance; see the posting for a number of benchmark results.

  • A fair amount of work has been put in toward the goal of allowing unprivileged users to mount filesystems in user namespaces. That goal still depends on a number of remaining loose ends being addressed, though, and so will not be achieved in the 4.8 development cycle.

  • The kernel has gained support for the Common Architecture Label IPv6 Security Option (CALIPSO) standard. CALIPSO can be used to attach security labels to packets, making them subject to normal (SELinux or Smack) security policies.

  • The PowerPC64 architecture now has a just-in-time compiler for BPF programs.

  • New hardware support includes:

    • Processors and systems: Artesyn MVME7100 single-board computers, R-Car V2H (R8A7792) systems-on-chip (SoCs), and Broadcom BCM23550 SoCs.

    • Audio: Analog Devices ADAU7002 Stereo PDM-to-I2S/TDM converters, Cirrus Logic CS53L30 and CS35L33 codecs, Maxim MAX9860 mono audio voice codecs, Maxim MAX98504 speaker amplifiers, and Allwinner A10 I2S audio interfaces.

    • Graphics: ARM Mali display processors, Silicon Image sii902x RGB/HDMI bridges, and Toshiba TC358767 eDP bridges.

    • Input: Atmel capacitive touch buttons, Ntrig/Microsoft Surface 3 SPI touchscreens, Raydium I2C touchscreens, Pegasus Mobile Notetaker Pen input tablets, and Alps I2C HID touchpads and StickPointers.

    • Miscellaneous: TI LP3952 2 channel LED controllers, Qualcomm Hexagon V5 peripheral image loaders, Marvell version 2 XOR engines, Xilinx ZynqMP DMA engines, R-Car R8A7796 clock pulse generators, Allwinner H3 clock-control units, AmLogic S905 clock controllers, PowerPC PowerNV PCI hotplug controllers, Aspeed 2400 watchdog timers, Maxim Max77620 watchdog timers, Amlogic Meson GXBB SoCs watchdog timers, Broadcom STB SDIO/SD/MMC host controllers, Broadcom PDC mailbox managers, Altera Arria10 DevKit system resource chips, Atmel external bus interface controllers, NVIDIA Tegra ACONNECT bus controllers, HiSilicon SPI-NOR flash controllers, MediaTek SDG1 NFC nand controllers, Atmel Quad SPI controllers, Cadence Quad SPI controllers, and Aardvark PCIe controllers.

    • Networking: Freescale QUICC Engine HDLC controllers, Broadcom BCM53xx Ethernet switches, Broadcom Northstar2 PCIe PHYs, Intel XWAY PHYs, Renesas R-Car CAN FD controllers, Hisilicon fast Ethernet MAC controllers, and APM X-Gene SoC MDIO bus controllers.

    • Pin control: Oxford Semiconductor OXNAS SoC family pin controllers, Maxim MAX77620/MAX20024 pin controllers, UniPhier PH1-LD11 and PH1-LD20 SoC pin controllers, Intel Merrifield pin controllers, Broadcom NSP pin controllers, Qualcomm 9615 pin controllers, and STMicroelectronics STM32F746 pin controllers.

Changes visible to kernel developers include:

  • The GCC plugin infrastructure patches have been merged, making it possible to use plugin modules to the compiler to modify how the kernel is built. As of this writing, plugins for coverage testing and calculation of cyclomatic complexity have been merged. The "latent entropy" plugin, which tries to generate entropy early in the bootstrap process, is in a pull request but has not been pulled as of this writing.

  • The new skb_array mechanism adds an array-based FIFO data structure for the queuing of network packets; see <linux/skb_array.h> for an overview of the API.

  • The task of reworking the CPU hotplug mechanism continues with the conversion of more notifiers to the new scheme. As Thomas Gleixner put it in the pull request: "Another 700 hundred line of unpenetrable maze gone".

The 4.8 merge window still has a few days to run, so expect a few more features to land before the 4.8-rc1 release comes out. Next week's Kernel Page will, of course, contain an update with the final changes to be merged for this development cycle.

Comments (9 posted)

Hardened usercopy

By Jake Edge
August 3, 2016

The kernel often copies data from and to user space, which makes copy_to_user() and copy_from_user() (and friends) rather frequently used kernel functions. But if the kernel can be tricked into copying too much data in either direction, security vulnerabilities can be the result. Long ago, grsecurity added the PAX_USERCOPY feature (created by the PaX team) to harden those calls, so that even poorly written code elsewhere in the kernel cannot truly copy more than it should. Code based on PAX_USERCOPY is now being proposed for inclusion into the mainline kernel.

Kees Cook posted the first version of his "hardened usercopy" patches in early July. The patches are based on some earlier work that Casey Schaufler had done to port the PAX_USERCOPY feature from grsecurity to the mainline. Essentially, it tries to ensure that address ranges used to copy data to and from user space are valid. Cook is also working on patches for two other parts of the PAX_USERCOPY feature; this piece is configured into the kernel with the CONFIG_HARDENED_USERCOPY option.

The main problems that can result from an errant user-space copy are either that too much data is copied to user space, resulting in leaking the contents of kernel memory, or that too much data is copied from user space, which can overwrite kernel memory. If an attacker can influence the allocation of objects on the kernel's heap and then overwrite some of those objects, they may be able to escalate privileges, run arbitrary code, or crash the kernel. Information leaks are generally less dangerous, but the kernel does have critical data (e.g. keys) that could be exposed. Beyond that, determining the layout of kernel memory by way of an information leak can also provide information needed to exploit other kernel flaws.

The patches add several tests of the arguments to the copy_*_user() functions, which have the following prototypes:

    long copy_from_user(void *to, const void __user * from, unsigned long n);
    long copy_to_user(void __user *to, const void *from, unsigned long n);
Each call involves a user-space pointer and a kernel-space pointer; the user-space pointers are already checked in current kernels, so the patches only add tests for the kernel-space pointers. Those tests ensure that the address range doesn't wrap past the end of memory, that the kernel-space pointer is not null, and that it does not point to a zero-length kmalloc() allocation (i.e. ZERO_OR_NULL_PTR() is false). Also, if the address range overlaps the kernel text (code) segment, it is rejected.

Beyond that, if the kernel-space address points into an object that has been allocated from the slab allocator, the patches ensure that what is being copied fits within the size of the object allocated. This check is performed by calling PageSlab() on the kernel address to see if it lies within a page that is handled by the slab allocator; it then calls an allocator-specific routine to determine whether the amount of data to be copied is fully within an allocated object. If the address range is not handled by the slab allocator, the patches will test that it is either within a single or compound page and that it does not span independently allocated pages.

In addition, for copies involving the stack, the copied range must fit within the current process's stack. If there is architecture support for identifying stack frames, the copied range must fit within a single frame.

In all cases, an address range that fails the tests will generate a log message with the pertinent information. It will also call BUG() to generate a kernel oops and kill the current process (i.e. the one that was trying to exploit a kernel hole of some kind).

The patch set is broken up into three logical chunks: the main patch that adds the tests, patches that enable the feature for specific architectures (originally, x86, arm, arm64, ia64, powerpc, and sparc, with s390 added in a more recent patch set), and two patches that add heap-checking support for the SLAB and SLUB allocators. Cook noted that the SLOB allocator support in grsecurity "seems entirely broken", so he focused on testing SLAB and SLUB. In addition, stack frame checking has only been implemented for x86.

Cook said that he "couldn't detect a measurable performance change with these features enabled", when running tests like kernel builds and hackbench. That suggested that the feature could be turned on by default at some point, though it is turned off by default for now. Ingo Molnar suggested running a system-call-heavy workload to see if that had any measurable performance degradation, as he would also like to see the feature on by default. Linus Torvalds said that a stat()-heavy workload (e.g. something like git diff) would be one way to test it, but indicated that he thought the checks would not be all that onerous.

Andy Lutomirski wondered if some of the infrastructure to validate the objects being copied should be given a different name, since it might be extended to more than just "usercopy" down the road. That set off a bit of a squabble between Molnar and PaX Team about the feature, threat models, and "bikeshedding". Cook, however, successfully tamped down the flickering flames:

There's a long history of misunderstanding and miscommunication (intentional or otherwise) by everyone on these topics. I'd love it if we can just side-step all of it, and try to stick as closely to the technical discussions as possible. Everyone involved in these discussions wants better security, even if we go about it in different ways. If anyone finds themselves feeling insulted, just try to let it go, and focus on the places where we can find productive common ground, remembering that any fighting just distracts from the more important issues at hand.

The patch set is in its fourth revision at this point; Cook has requested that it be pulled for 4.8. In the review process, some bugs have been fixed (notably some arm64 fixes and additions from Laura Abbott) and changes made, but no fundamental disagreement with the feature has emerged. As of this writing, the patches have not been pulled, but there were some prerequisites so it may simply be that Torvalds just hasn't gotten to it yet. But, if not for 4.8, it seems likely that we will see the feature appear in the mainline fairly soon.

Comments (none posted)

Statistics from the 4.7 development cycle

By Jonathan Corbet
August 2, 2016
The 4.7 kernel was released on July 24, so longtime readers might be wondering where the usual development statistics are. We're running a little late this time around, but for good reason — Greg Kroah-Hartman obtained information from a large number of developers on who they work for, and we're now able to use that information to produce better numbers. Of course, the overall story hasn't changed a whole lot — kernel development is relatively boring and predictable these days — but each cycle still has a few noteworthy points.

The 4.7 development cycle saw the merging of 12,283 changesets from 1,582 developers; 232 of those developers appeared in the kernel changelog for the first time. Those changes added just under 300,000 lines to the kernel source and 740 new files to the kernel tree. Of those developers, the most active were:

Most active 4.7 developers
By changesets
H Hartley Sweeten2081.7%
Boris Brezillon1321.1%
Al Viro1271.0%
Linus Walleij1211.0%
Geert Uytterhoeven1201.0%
Arnaldo Carvalho de Melo1100.9%
Ville Syrjälä1050.9%
Laxman Dewangan1010.8%
Arnd Bergmann970.8%
Jes Sorensen970.8%
Eric Dumazet910.7%
Dan Carpenter880.7%
Aneesh Kumar K.V790.6%
Michal Hocko740.6%
Chris Wilson710.6%
Wolfram Sang680.6%
Florian Westphal660.5%
James Hogan660.5%
Daniel Vetter640.5%
Imre Deak620.5%
By changed lines
Alex Deucher371856.4%
Rex Zhu199123.4%
Paul E. McKenney140042.4%
Thierry Reding91701.6%
Jinshan Xiong88281.5%
Yuval Mintz84191.4%
Jes Sorensen69821.2%
Chanwoo Choi57421.0%
H Hartley Sweeten57051.0%
Varun Prakash57031.0%
Boris Brezillon53470.9%
Aneesh Kumar K.V52300.9%
Tom Zanussi51160.9%
CK Hu50720.9%
Ilya Dryomov47640.8%
Linus Walleij47380.8%
Maxime Ripard46310.8%
Mathieu Poirier45590.8%
Christoph Hellwig42320.7%
Finn Thain40240.7%

By this point it should come as no surprise that H Hartley Sweeten made it to the top of the "by changesets" list with continued work on the Comedi drivers in the staging tree; nearly 8,400 patches have gone into that subsystem since it was merged. Boris Brezillon's work was mostly focused on the memory-technology devices subsystem (and NAND controllers in particular), Al Viro made a number of fundamental changes (including parallel lookups) to the virtual filesystem layer and followed the implications of those changes through many filesystems, Linus Walleij has been reworking the GPIO subsystem, and Geert Uytterhoeven worked all over the tree, with an emphasis on various ARM-related subsystems.

In the "lines changed" column, Alex Deucher continues to work on the massive amdgpu graphics driver; Rex Zhu is also working primarily on that driver. Paul McKenney works with the read-copy-update subsystem, of course; the elevated line count this time around results from some large documentation changes. Thierry Reding works with the NVIDIA Tegra ARM subarchitecture, and Jinshan Xiong made some extensive changes to the Lustre filesystem in the staging tree.

Often work in the staging tree tends to overshadow everything else when it comes to these lists, but, this time around, only two developers who appear in the top ten on either side were working on staging code.

There were 222 companies (that we know about) that supported work merged in the 4.7 development cycle — a fairly average figure for recent years. The most active companies this time around were:

Most active 4.7 employers
By changesets
Intel178614.5%
(None)9687.9%
Red Hat9677.9%
(Unknown)8617.0%
Linaro6335.2%
SUSE4703.8%
IBM3783.1%
AMD3022.5%
Samsung2762.2%
Google2442.0%
Renesas Electronics2442.0%
NVIDIA2311.9%
Mellanox2271.8%
Free Electrons2221.8%
ARM2171.8%
Vision Engraving Systems2081.7%
Oracle2001.6%
Imagination Technologies1931.6%
Texas Instruments1851.5%
Broadcom1411.1%
By lines changed
Intel8605614.8%
AMD6906511.8%
(None)350356.0%
Red Hat338875.8%
IBM281024.8%
Linaro233964.0%
(Unknown)232874.0%
NVIDIA180233.1%
Mellanox140112.4%
Samsung129182.2%
SUSE128102.2%
Free Electrons126372.2%
QLogic117312.0%
ARM90001.5%
Rockchip89381.5%
Renesas Electronics87341.5%
Texas Instruments74621.3%
(Consultant)69641.2%
Chelsio68681.2%
Broadcom65641.1%

This table looks as it has for some time, no real surprises here. The percentage of changes from developers working on their own time, at 7.9%, is up from 4.6, but still remains low by historical standards. Once upon a time, volunteer developers were our primary source of new contributors to the kernel. In 4.7, of the 232 first-time contributors, 132 were known to be employed at the time, 38 were known to be working on their own time, and 62 are in the "unknown" column. Even if all the unknowns are volunteers (most of them probably are), we still have more new contributors arriving via companies.

Contributing to the kernel used to be a fairly reliable way to get a job, and it probably still is. But, in 2016, it seems that many of our new developers get the job first, and it is the job that brings them to the kernel community.

The table above shows the changes contributed by the most active companies. One last question one might ask is: how many developers does each company have working on Linux? For the 4.7 development cycle, the answer looks like this:

# of developers/company
CompanyCountPercent
(Unknown)23814.5%
Intel19812.1%
(None)17210.5%
Red Hat915.6%
IBM643.9%
Google482.9%
Linaro432.6%
Mellanox382.3%
SUSE372.3%
AMD301.8%
Samsung271.6%
Huawei Technologies271.6%
ARM251.5%
Texas Instruments231.4%
Broadcom221.3%
Oracle211.3%
NXP201.2%
Qualcomm171.0%
MediaTek130.8%
Imagination Technologies120.7%
Renesas Electronics120.7%
Facebook110.7%
NVIDIA110.7%
Code Aurora Forum100.6%
(Consultant)100.6%
Rockchip100.6%
Canonical100.6%
Free Electrons90.5%
Pengutronix90.5%
Synopsys80.5%

Intel, it seems, has far more developers working on the kernel than any other company — nearly 12% of the total in 4.7. Volunteer developers may not contribute a lot of code, but there are quite a few of them; given that many (if not most) of the unknown developers probably fall into this category, developers working on their own time are still the biggest group.

The kernel community as a whole is a big group indeed, and it continues to produce kernels in a disciplined and predictable way. The relative lack of surprises may make for relatively boring statistics articles, but it is certainly welcome to users of the kernel.

Comments (32 posted)

Patches and updates

Kernel trees

Levin, Alexander Linux 4.1.29 ?
Levin, Alexander Linux 3.18.38 ?
Steven Rostedt 3.12.62-rt83 ?

Core kernel code

Development tools

Device drivers

Device driver infrastructure

Filesystems and block I/O

Memory management

Networking

Security-related

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2016, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds