Kernel development
Brief items
Kernel release status
The 4.8 merge window is still open; see the separate article below for a list of what has been merged in the last week.Stable updates: none have been released in the last week.
Quotes of the week
The real problem is, as always, a lack of resources to implement everything we want to be able to do. Building a new filesystem is hard, takes a long time, and all the people we have that might be able to do it are fully occupied by maintaining and enhancing the existing Linux filesystems to support things like DAX or other functionality that users want (e.g. rmap, reflink, copy offload, etc).
File permissions in the kernel
There are many ways to make a poor impression in the kernel community; Baole Ni surely stumbled across one of them: post 1,285 separate cleanup patches, each with the same subject line, and each copied to a long list of developers. It was, David Miller said, "one of the worst patch series submissions in history." In theory, the objective of the patch was reasonable: replace hard-coded constants with their symbolic equivalents. But, it seems, this is a case where the community would rather see the numbers directly.
The change in question relates to places in the kernel where file permissions are specified — usually permissions for files to be created in sysfs or /proc. There is a set of macros defined in <linux/stat.h> for these permissions bits, but it is common practice in the kernel — and among users of Unix-like systems in general — to just use their octal equivalent instead. Thus, for example, one will often see 0444 instead of S_IRUGO. Indeed, it seems one will see it at least 1,285 times, given the length of the patch set sent to eliminate octal permissions from the kernel.
There were obviously lots of complaints about how the patch set was done, but there was also a lot of opposition to the change itself. It seems that many people find a string like 0644 easier to read than S_IWUSR|S_IRUGO. In the end, Linus made that approach official, saying that he did not want to see any of the cleanup patches merged and that, in fact, it would be better to convert users of the macros to octal strings instead.
Octal constants are not perfect either; as Al Viro pointed out, they are subject to subtle and hard-to-see errors. Perhaps, it was suggested, the real problem is that the (POSIX-defined) S_* macros are hard to read, obscuring the developer's intent rather than clarifying it. As an alternative, Ingo Molnar has proposed the adoption of a new set of macros, defined like this:
#define PERM_rw_______ 0600
#define PERM_rw_r_____ 0640
#define PERM_rw_r__r__ 0644
#define PERM_rw_rw_r__ 0664
#define PERM_rw_rw_rw_ 0666
All of the "useful" combinations have macros defined, while there are none for settings that don't make sense. Use of these macros, he said, would make the code clearer and make it harder to introduce security problems. Actually getting them merged, though, might require overcoming the habits of developers who have been typing octal constants for decades. The eventual discussion could yet end up being longer than the patch series that provoked it.
Kernel development news
4.8 Merge window part 2
As of this writing, Linus has pulled 10,589 non-merge changesets into the mainline repository; that is 7,433 since last week's summary. Clearly it has been a busy week. As is often the case, much of the work being merged takes the form of internal improvements that are not immediately visible to kernel users, but a number of interesting features have found their way in as well.Some of the more significant user-visible features include:
- The arm64 architecture has gained support for the kexec mechanism
(allowing one kernel to boot directly into another) and kernel
probes.
- The TCP "New Vegas" congestion-control algorithm is now supported.
New Vegas is a significant update to Vegas, adding better support for
data-center settings in particular. See this
document for details.
- The mac80211 ("WiFi") layer has seen some interesting
congestion-control changes. Normal queuing disciplines interact
poorly with the frame
aggregation mechanism used by wireless protocols, leading to poor
performance, so the queuing discipline code has been disabled for
mac80211. Instead, the mac80211 layer
is now using the CoDel fair-queuing
algorithm. This should be a significant step forward for better WiFi
performance on Linux.
- The reliable
datagram sockets (RDS) protocol allows the creation of
datagram-oriented connections over a TCP link. In 4.8, the RDS
implementation can use multiple TCP connections to support RDS routing
between two hosts, greatly increasing the maximum throughput. See this
changelog for some details and a discussion of how this protocol
differs from multipath TCP.
- The "express data path" (XDP) work described in this article has moved forward. In 4.8,
network drivers can define a hook allowing a BPF program to be loaded;
that program will run on incoming packets before they even have
internal data structures set up for them. The hook can indicate that
packets should be dropped, but it also has the ability to do simple
rewriting and forwarding. For some types of workloads, the result can
be greatly increased performance without the need for kernel bypass
techniques.
- The kernel's pseudo-random number generator has been replaced with a new
implementation using the ChaCha20
stream cipher. There have also been some changes made to address
scalability problems when user-space programs are consuming massive
amounts of random data.
- The memory-management subsystem's page-reclaim mechanism has been
fundamentally reworked to track pages based on NUMA nodes rather than
on memory zones. As Mel Gorman noted in the patch posting, zone-based reclaim was
important in the days of 32-bit systems with a lot of high memory
but, now that large-memory systems are mostly running 64-bit kernels,
node-based reclaim is a more suitable approach. Users should see
little change beyond, hopefully, better performance; see the posting
for a number of benchmark results.
- A fair amount of work has been put in toward the goal of allowing
unprivileged users to mount filesystems in user namespaces. That goal
still depends on a number of remaining loose ends being addressed,
though, and so will not be achieved in the 4.8 development cycle.
- The kernel has gained support for the Common Architecture Label
IPv6 Security Option (CALIPSO) standard. CALIPSO can be used to
attach security labels to packets, making them subject to normal
(SELinux or Smack) security policies.
- The PowerPC64 architecture now has a just-in-time compiler for BPF
programs.
- New hardware support includes:
- Processors and systems:
Artesyn MVME7100 single-board computers,
R-Car V2H (R8A7792) systems-on-chip (SoCs), and
Broadcom BCM23550 SoCs.
- Audio:
Analog Devices ADAU7002 Stereo PDM-to-I2S/TDM converters,
Cirrus Logic CS53L30 and CS35L33 codecs,
Maxim MAX9860 mono audio voice codecs,
Maxim MAX98504 speaker amplifiers, and
Allwinner A10 I2S audio interfaces.
- Graphics:
ARM Mali display processors,
Silicon Image sii902x RGB/HDMI bridges, and
Toshiba TC358767 eDP bridges.
- Input:
Atmel capacitive touch buttons,
Ntrig/Microsoft Surface 3 SPI touchscreens,
Raydium I2C touchscreens,
Pegasus Mobile Notetaker Pen input tablets, and
Alps I2C HID touchpads and StickPointers.
- Miscellaneous:
TI LP3952 2 channel LED controllers,
Qualcomm Hexagon V5 peripheral image loaders,
Marvell version 2 XOR engines,
Xilinx ZynqMP DMA engines,
R-Car R8A7796 clock pulse generators,
Allwinner H3 clock-control units,
AmLogic S905 clock controllers,
PowerPC PowerNV PCI hotplug controllers,
Aspeed 2400 watchdog timers,
Maxim Max77620 watchdog timers,
Amlogic Meson GXBB SoCs watchdog timers,
Broadcom STB SDIO/SD/MMC host controllers,
Broadcom PDC mailbox managers,
Altera Arria10 DevKit system resource chips,
Atmel external bus interface controllers,
NVIDIA Tegra ACONNECT bus controllers,
HiSilicon SPI-NOR flash controllers,
MediaTek SDG1 NFC nand controllers,
Atmel Quad SPI controllers,
Cadence Quad SPI controllers, and
Aardvark PCIe controllers.
- Networking:
Freescale QUICC Engine HDLC controllers,
Broadcom BCM53xx Ethernet switches,
Broadcom Northstar2 PCIe PHYs,
Intel XWAY PHYs,
Renesas R-Car CAN FD controllers,
Hisilicon fast Ethernet MAC controllers, and
APM X-Gene SoC MDIO bus controllers.
- Pin control: Oxford Semiconductor OXNAS SoC family pin controllers, Maxim MAX77620/MAX20024 pin controllers, UniPhier PH1-LD11 and PH1-LD20 SoC pin controllers, Intel Merrifield pin controllers, Broadcom NSP pin controllers, Qualcomm 9615 pin controllers, and STMicroelectronics STM32F746 pin controllers.
- Processors and systems:
Artesyn MVME7100 single-board computers,
R-Car V2H (R8A7792) systems-on-chip (SoCs), and
Broadcom BCM23550 SoCs.
Changes visible to kernel developers include:
- The GCC plugin infrastructure patches
have been merged, making it possible to use plugin modules to the
compiler to modify how the kernel is built. As of this writing,
plugins for coverage testing and calculation of cyclomatic complexity
have been merged. The "latent entropy" plugin, which tries to
generate entropy early in the bootstrap process, is in a pull request
but has not been pulled as of this writing.
- The new skb_array mechanism adds an array-based FIFO data
structure for the queuing of network packets; see <linux/skb_array.h> for an
overview of the API.
- The task of reworking the CPU hotplug
mechanism continues with the conversion of more notifiers to the
new scheme. As Thomas Gleixner put it in the
pull request: "
Another 700 hundred line of unpenetrable maze gone
".
The 4.8 merge window still has a few days to run, so expect a few more features to land before the 4.8-rc1 release comes out. Next week's Kernel Page will, of course, contain an update with the final changes to be merged for this development cycle.
Hardened usercopy
The kernel often copies data from and to user space, which makes copy_to_user() and copy_from_user() (and friends) rather frequently used kernel functions. But if the kernel can be tricked into copying too much data in either direction, security vulnerabilities can be the result. Long ago, grsecurity added the PAX_USERCOPY feature (created by the PaX team) to harden those calls, so that even poorly written code elsewhere in the kernel cannot truly copy more than it should. Code based on PAX_USERCOPY is now being proposed for inclusion into the mainline kernel.
Kees Cook posted the first version of his "hardened usercopy" patches in early July. The patches are based on some earlier work that Casey Schaufler had done to port the PAX_USERCOPY feature from grsecurity to the mainline. Essentially, it tries to ensure that address ranges used to copy data to and from user space are valid. Cook is also working on patches for two other parts of the PAX_USERCOPY feature; this piece is configured into the kernel with the CONFIG_HARDENED_USERCOPY option.
The main problems that can result from an errant user-space copy are either that too much data is copied to user space, resulting in leaking the contents of kernel memory, or that too much data is copied from user space, which can overwrite kernel memory. If an attacker can influence the allocation of objects on the kernel's heap and then overwrite some of those objects, they may be able to escalate privileges, run arbitrary code, or crash the kernel. Information leaks are generally less dangerous, but the kernel does have critical data (e.g. keys) that could be exposed. Beyond that, determining the layout of kernel memory by way of an information leak can also provide information needed to exploit other kernel flaws.
The patches add several tests of the arguments to the copy_*_user() functions, which have the following prototypes:
long copy_from_user(void *to, const void __user * from, unsigned long n);
long copy_to_user(void __user *to, const void *from, unsigned long n);
Each call involves a user-space pointer and a kernel-space pointer; the
user-space pointers are already checked in current kernels, so the patches
only add tests for the kernel-space pointers.
Those tests ensure that the address range doesn't wrap past the end of memory,
that the kernel-space pointer is not null, and that it does not point to a
zero-length kmalloc() allocation
(i.e. ZERO_OR_NULL_PTR() is false).
Also, if
the address range overlaps
the kernel text (code) segment, it is rejected.
Beyond that, if the kernel-space address points into an object that has been allocated from the slab allocator, the patches ensure that what is being copied fits within the size of the object allocated. This check is performed by calling PageSlab() on the kernel address to see if it lies within a page that is handled by the slab allocator; it then calls an allocator-specific routine to determine whether the amount of data to be copied is fully within an allocated object. If the address range is not handled by the slab allocator, the patches will test that it is either within a single or compound page and that it does not span independently allocated pages.
In addition, for copies involving the stack, the copied range must fit within the current process's stack. If there is architecture support for identifying stack frames, the copied range must fit within a single frame.
In all cases, an address range that fails the tests will generate a log message with the pertinent information. It will also call BUG() to generate a kernel oops and kill the current process (i.e. the one that was trying to exploit a kernel hole of some kind).
The patch set is broken up into three logical chunks: the main patch that
adds the tests, patches that enable the feature for specific
architectures (originally, x86, arm, arm64, ia64, powerpc, and sparc, with
s390 added in a more recent patch set), and two patches that add
heap-checking support for the SLAB
and SLUB allocators. Cook noted that the SLOB allocator support in
grsecurity "seems entirely broken
", so he focused on testing
SLAB and SLUB. In addition, stack frame checking has only been implemented
for x86.
Cook said that he "couldn't detect a measurable performance change
with these features enabled
", when running tests like kernel builds
and hackbench. That suggested that the feature could be turned on by
default at some point, though it is turned off by default for now. Ingo
Molnar suggested running a
system-call-heavy workload to see if that had any measurable performance
degradation, as he would also like to see the feature on by default. Linus
Torvalds said that a
stat()-heavy workload (e.g. something like git diff)
would be one way to test it, but indicated that he thought the checks would
not be all that onerous.
Andy Lutomirski wondered if some of the infrastructure to validate the objects being copied should be given a different name, since it might be extended to more than just "usercopy" down the road. That set off a bit of a squabble between Molnar and PaX Team about the feature, threat models, and "bikeshedding". Cook, however, successfully tamped down the flickering flames:
The patch set is in its fourth revision at this point; Cook has requested that it be pulled for 4.8. In the review process, some bugs have been fixed (notably some arm64 fixes and additions from Laura Abbott) and changes made, but no fundamental disagreement with the feature has emerged. As of this writing, the patches have not been pulled, but there were some prerequisites so it may simply be that Torvalds just hasn't gotten to it yet. But, if not for 4.8, it seems likely that we will see the feature appear in the mainline fairly soon.
Statistics from the 4.7 development cycle
The 4.7 kernel was released on July 24, so longtime readers might be wondering where the usual development statistics are. We're running a little late this time around, but for good reason — Greg Kroah-Hartman obtained information from a large number of developers on who they work for, and we're now able to use that information to produce better numbers. Of course, the overall story hasn't changed a whole lot — kernel development is relatively boring and predictable these days — but each cycle still has a few noteworthy points.The 4.7 development cycle saw the merging of 12,283 changesets from 1,582 developers; 232 of those developers appeared in the kernel changelog for the first time. Those changes added just under 300,000 lines to the kernel source and 740 new files to the kernel tree. Of those developers, the most active were:
Most active 4.7 developers
By changesets H Hartley Sweeten 208 1.7% Boris Brezillon 132 1.1% Al Viro 127 1.0% Linus Walleij 121 1.0% Geert Uytterhoeven 120 1.0% Arnaldo Carvalho de Melo 110 0.9% Ville Syrjälä 105 0.9% Laxman Dewangan 101 0.8% Arnd Bergmann 97 0.8% Jes Sorensen 97 0.8% Eric Dumazet 91 0.7% Dan Carpenter 88 0.7% Aneesh Kumar K.V 79 0.6% Michal Hocko 74 0.6% Chris Wilson 71 0.6% Wolfram Sang 68 0.6% Florian Westphal 66 0.5% James Hogan 66 0.5% Daniel Vetter 64 0.5% Imre Deak 62 0.5%
By changed lines Alex Deucher 37185 6.4% Rex Zhu 19912 3.4% Paul E. McKenney 14004 2.4% Thierry Reding 9170 1.6% Jinshan Xiong 8828 1.5% Yuval Mintz 8419 1.4% Jes Sorensen 6982 1.2% Chanwoo Choi 5742 1.0% H Hartley Sweeten 5705 1.0% Varun Prakash 5703 1.0% Boris Brezillon 5347 0.9% Aneesh Kumar K.V 5230 0.9% Tom Zanussi 5116 0.9% CK Hu 5072 0.9% Ilya Dryomov 4764 0.8% Linus Walleij 4738 0.8% Maxime Ripard 4631 0.8% Mathieu Poirier 4559 0.8% Christoph Hellwig 4232 0.7% Finn Thain 4024 0.7%
By this point it should come as no surprise that H Hartley Sweeten made it to the top of the "by changesets" list with continued work on the Comedi drivers in the staging tree; nearly 8,400 patches have gone into that subsystem since it was merged. Boris Brezillon's work was mostly focused on the memory-technology devices subsystem (and NAND controllers in particular), Al Viro made a number of fundamental changes (including parallel lookups) to the virtual filesystem layer and followed the implications of those changes through many filesystems, Linus Walleij has been reworking the GPIO subsystem, and Geert Uytterhoeven worked all over the tree, with an emphasis on various ARM-related subsystems.
In the "lines changed" column, Alex Deucher continues to work on the massive amdgpu graphics driver; Rex Zhu is also working primarily on that driver. Paul McKenney works with the read-copy-update subsystem, of course; the elevated line count this time around results from some large documentation changes. Thierry Reding works with the NVIDIA Tegra ARM subarchitecture, and Jinshan Xiong made some extensive changes to the Lustre filesystem in the staging tree.
Often work in the staging tree tends to overshadow everything else when it comes to these lists, but, this time around, only two developers who appear in the top ten on either side were working on staging code.
There were 222 companies (that we know about) that supported work merged in the 4.7 development cycle — a fairly average figure for recent years. The most active companies this time around were:
Most active 4.7 employers
By changesets Intel 1786 14.5% (None) 968 7.9% Red Hat 967 7.9% (Unknown) 861 7.0% Linaro 633 5.2% SUSE 470 3.8% IBM 378 3.1% AMD 302 2.5% Samsung 276 2.2% 244 2.0% Renesas Electronics 244 2.0% NVIDIA 231 1.9% Mellanox 227 1.8% Free Electrons 222 1.8% ARM 217 1.8% Vision Engraving Systems 208 1.7% Oracle 200 1.6% Imagination Technologies 193 1.6% Texas Instruments 185 1.5% Broadcom 141 1.1%
By lines changed Intel 86056 14.8% AMD 69065 11.8% (None) 35035 6.0% Red Hat 33887 5.8% IBM 28102 4.8% Linaro 23396 4.0% (Unknown) 23287 4.0% NVIDIA 18023 3.1% Mellanox 14011 2.4% Samsung 12918 2.2% SUSE 12810 2.2% Free Electrons 12637 2.2% QLogic 11731 2.0% ARM 9000 1.5% Rockchip 8938 1.5% Renesas Electronics 8734 1.5% Texas Instruments 7462 1.3% (Consultant) 6964 1.2% Chelsio 6868 1.2% Broadcom 6564 1.1%
This table looks as it has for some time, no real surprises here. The percentage of changes from developers working on their own time, at 7.9%, is up from 4.6, but still remains low by historical standards. Once upon a time, volunteer developers were our primary source of new contributors to the kernel. In 4.7, of the 232 first-time contributors, 132 were known to be employed at the time, 38 were known to be working on their own time, and 62 are in the "unknown" column. Even if all the unknowns are volunteers (most of them probably are), we still have more new contributors arriving via companies.
Contributing to the kernel used to be a fairly reliable way to get a job, and it probably still is. But, in 2016, it seems that many of our new developers get the job first, and it is the job that brings them to the kernel community.
The table above shows the changes contributed by the most active companies. One last question one might ask is: how many developers does each company have working on Linux? For the 4.7 development cycle, the answer looks like this:
# of developers/company Company Count Percent (Unknown) 238 14.5% Intel 198 12.1% (None) 172 10.5% Red Hat 91 5.6% IBM 64 3.9% 48 2.9% Linaro 43 2.6% Mellanox 38 2.3% SUSE 37 2.3% AMD 30 1.8% Samsung 27 1.6% Huawei Technologies 27 1.6% ARM 25 1.5% Texas Instruments 23 1.4% Broadcom 22 1.3% Oracle 21 1.3% NXP 20 1.2% Qualcomm 17 1.0% MediaTek 13 0.8% Imagination Technologies 12 0.7% Renesas Electronics 12 0.7% 11 0.7% NVIDIA 11 0.7% Code Aurora Forum 10 0.6% (Consultant) 10 0.6% Rockchip 10 0.6% Canonical 10 0.6% Free Electrons 9 0.5% Pengutronix 9 0.5% Synopsys 8 0.5%
Intel, it seems, has far more developers working on the kernel than any other company — nearly 12% of the total in 4.7. Volunteer developers may not contribute a lot of code, but there are quite a few of them; given that many (if not most) of the unknown developers probably fall into this category, developers working on their own time are still the biggest group.
The kernel community as a whole is a big group indeed, and it continues to produce kernels in a disciplined and predictable way. The relative lack of surprises may make for relatively boring statistics articles, but it is certainly welcome to users of the kernel.
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Device driver infrastructure
Filesystems and block I/O
Memory management
Networking
Security-related
Page editor: Jonathan Corbet
Next page:
Distributions>>
