Kernel development [LWN.net]

Kernel release status

The 3.19 merge window is still open; see the separate article below for a summary of the (many) changes merged in the last week.

Stable updates: 3.18.1, 3.17.7, 3.14.27, and 3.10.63 were released on December 16.

Comments (none posted)

Quotes of the week

Joy.. 1am middle of a torrential rainstorm trying to rig a tarpaulin over the basement before the drain at the bottom overtops.. ok there are worse jobs than writing floppy drivers

— Alan Cox

/* Cetero censeo, checkpatch.pl esse delendam */

— Al Viro

Giving root the power to shoot himself in the foot is one thing. Giving root a loaded gun pointed at his foot with the hammer pulled back, and a sign that says I dare you to pull the trigger, seems like a bad idea.

— Eric Biederman

I have been the wireless maintainer for a long time, and I personally would like to develop in a different direction. Plus, I think that Linux will benefit from having some fresh blood involved in more of the maintenance duties. I will be stepping aside to let that happen.

— John Linville

Comments (6 posted)

Scalability Techniques for Practical Synchronization Primitives (ACM Queue)

Davidlohr Bueso gives an overview of kernel locking scalability techniques in this ACM Queue article. "There have recently been significant efforts to address lock-scaling issues in the Linux kernel on large high-end servers. Many of the problems and solutions apply to similar system software. This article applies general ideas and lessons learned to a wider systems context, in the hope that it can be helpful to people who are encountering similar scaling problems."

Comments (1 posted)

nftables 0.4 released

For those of you following the development of nftables (the virtual-machine-based eventual replacement for iptables) version 0.4 of the user-space nftables utility is out. It provides access to a lot of new features, including global ruleset operations, improved logging support, masquerading and NAT, redirect support (will need a 3.19 kernel), and a lot of fixes.

Full Story (comments: 2)

3.19 Merge window part 2

By Jonathan Corbet
December 17, 2014

Last week's 3.19 merge window summary noted that things had gotten off to a slow start. Linus has made up for lost time since then, though; as of this writing, just over 10,400 changesets have been pulled into the mainline repository — over 8,000 since last week. Needless to say, those changes represent a great deal of fixes and new work. The most significant user-visible changes include:

The networking layer has a new subsystem for offloading switching and routing duties to suitably capable hardware.
The NFS client and server both now support the NFS 4.2 ALLOCATE and DEALLOCATE options. The former can be used to request preallocation of storage for a file, while the latter is useful for punching holes.
The f2fs filesystem has a new "fastboot" option that shorts out a number of boot-time checks.
Filters used with the ftrace subsystem now support the logical NOT ("!") operator in expressions.
Device tree overlay support has been merged. This feature should make life easier for developers working on systems with "shields" or other types of daughterboards that need to be worked into the device tree at system boot time.
There is a new getsockopt() option called SO_INCOMING_CPU. It returns the CPU on which processing for the given socket is happening. When used with multi-queue hardware on large systems, this option can allow an application to divide work across processors, maximizing throughput.
It is now possible to attach enhanced BPF programs to network sockets. For now, this capability can only be used for statistics gathering, but other applications should become possible in future development cycles.
The new "ipvlan" driver enable the creation of virtual network devices for container interconnection. It is designed to work well with network namespaces. Ipvlan is much like the existing macvlan driver, but it does its multiplexing at a higher level in the stack.
The Btrfs filesystem's RAID5 and RAID6 implementation finally has support for disk scrubbing and replacement.
The execveat() system call has been merged. Like the other "at" system calls, it takes a file descriptor for the directory to be used as the starting point for finding the executable file. It can also be used to execute a binary file directly from an open file descriptor, allowing for a better implementation of the fexecve() system call found on other Unix-like systems.
The squashfs filesystem now supports compression with the LZ4 algorithm.
The "AMD KFD" driver has been merged; it provides a new interface to graphical processors for non-graphics (e.g. GPGPU) applications.
Some complaints on the mailing lists notwithstanding, the Android "binder" code has been moved from the staging tree into the kernel proper. In the end, it's an API that has been shipped in millions of systems and has to be supported somehow.
New hardware support includes:
- Audio: Intel Baytrail-based audio devices, Samsung Exynos7 I2S controllers, NXP Semiconductors TFA9879 amplifiers, and Texas Instruments TS3A227E headset chips.
- Graphics: Sharp LQ101R1SX01 panels, Freescale i.MX GPUs (staging graduation), R-Car DU HDMI encoders, Analog Device ADV7511(W) and ADV7513 HDMI encoders, and Rockchip SoC-based GPUs.
- IIO: Silicon Labs Si7013/20/21 humidity/temperature sensors, Bosch Sensortec BMP280 pressure sensors, and Qualcomm SPMI PMIC current analog-to-digital converters.
- Miscellaneous: Dallas/Maxim DS1374 watchdog timers, Freescale Layerscape PCIe controllers, Qualcomm SPMI PMIC pin controllers, Intel Cherryview/Braswell pin controllers, IMG synchronous peripheral flash interfaces, IMG I2C serial control bus controllers, Amlogic Meson I2C controllers, Amlogic Meson SPI flash controllers, ACPI "platform communication channel" devices, IBM OPAL real-time clocks, IBM PowerNV OPAL IPMI interfaces, IPMI controllers connected via SMBUS, TI OMAP internal UARTs, Xilinx Clocking Wizard clock generators (staging), and TI LP8860 4 channel LED controllers.
- Networking: Marvell 88E6352 ethernet switch chips and Rocker network switches.
- USB: Broadcom USB3.0 device controllers, STMicroelectronics MIPHY28LP PHYs, and Marvell Berlin USB PHYs.
- Video4Linux: DVBSky S950 V3 video bridges, Montage M88RS6000 internal tuners, Panasonic MN88472 and MN88473 demodulators, and Amlogic Meson IR remote receivers.

Changes visible to kernel developers include:

The x86 memory-management code now makes fuller use of the page attribute table (PAT) modes offered by current processors. In particular, this change enables the use of write-through caching.
There is a new API that allows drivers to obtain device property information from either ACPI or a device tree without having to know which is in use. See this commit for a brief overview of the new calls and this commit for a related interface for use when no device structure is present.
The virtio subsystem has seen a lot of work to make it comply with the virtio 1.0 standard.
The I2C subsystem can now enable a Linux system to act like an I2C slave if the bus controller supports that mode. Documentation is nonexistent, but we are promised that it will show up before the end of the cycle.
The GPIO subsystem can now change the values of multiple GPIO outputs simultaneously — if the hardware supports it, of course. See the documentation changes at the top of this commit for a list of the API additions to support this functionality.
The owner field has been dropped from struct platform_driver, leading to extensive tree-wide changes to remove all uses of that field.
Support for the ARM "Coresight" tracing mechanism has been added to the kernel. See Documentation/trace/coresight.txt for information about this subsystem and how to work with it.
Atomic modesetting support has been added to the direct rendering layer; this feature allows multiple graphical mode parameters to be set in a single, atomic operation. See this merge commit for an overview of what's provided. One important thing that is still missing is the actual ioctl() to provide the feature to user space; that will likely come in 3.20 along with more driver support.

As always, the kernel has a wide variety of contributors. While it is often hard to tell a contributor's age from their posted patches, your editor is confident that this patch is the first from a four-year-old to ever make it into the kernel.

At this point, most of the major trees (from contributors of all ages) have been pulled, so the rate of change in the mainline repository can be expected to slow. That said, the merge window will probably remain open until December 21. Next week's summary will cover the final patches that are pulled for the 3.19 development cycle.

Comments (4 posted)

On the problem of maintainer abuse

By Jonathan Corbet
December 17, 2014

As can be seen in the LWN kernel patch tracker (or in the patches section of the weekly Kernel Page), there have been a lot of significant patch sets posted over the course of the last week or so. The pace of kernel development continues to increase, so there is always a lot of new code out there in need of review. There's just one little potential problem: as of this writing, the 3.19 merge window is open, and many subsystem maintainers are busy getting the current set of changes into the mainline and dealing with any resulting fallout. They are unlikely to have much spare time for patch review.

Merge-window patch postings are not uncommon; most maintainers either just defer looking at them or ignore them altogether. This time around, though, Thomas Gleixner vented his frustration on the linux-kernel list:

Nothing of this is 3.19 material so posting it right now is just useless. I'm not going to look at it and I'm not going to look at it next week either. This whole featuritis driven 'post crap as fast as you can' thing has to stop, really.

Posting patches during the merge window, he said, constitutes "maintainer abuse."

Some developers agreed with these sentiments, and Kevin Cernekee promptly posted (during the merge window, of course) a patch series titled "Stop maintainer abuse" trying to codify a rule that patches should not be posted during merge windows. Patches meant for the next merge window should, by these rules, be posted prior to the preceding -rc5 release; patches posted after -rc5 comes out will end up coming out one release later. And no patches at all, other than urgent fixes, should be posted while the merge window is open.

There is a potential problem, though, in that not all subsystem maintainers work the same way. Christoph Hellwig disagreed with the rules, saying:

Merge window isn't really special, and patches can easily be reviewed and queued up for the next merge window in that time. If it said you shouldn't expect replies and not _resend_ during the merge window that seems like a much saner policy.

Linus responded to the posting guidelines by noting that they don't apply equally to all maintainers:

[F]or fairly simple subsystems in particular, some maintainers basically have their pull requests for the merge window open *before* the merge window even starts, and for them, the merge window itself isn't actually all that busy, it's often the week before that is the busy one. So the exact timing can vary by maintainership, and while I think the above is a reasonable example, it should perhaps be documented as such.

Alan Cox worried that trying to put a lid on patch postings is always doomed to failure:

Every time anyone has tried to deal with Linux scaling problems by throttling the rate it has failed, from the near forking of it when Linus couldn't cope onwards. Today we are already seeing the same occurring with all the vendor trees, and shared downstream trees with a rapidly growing amount of stuff that simply isn't upstream because upstream can't keep up with actual product timescales any more.

His suggestion was to, instead, try to streamline the process a bit, mostly by improving the patchwork system to automate (or at least assist) many common maintainer duties. He finished by proposing that: "It could then be integrated into git (if only so we can have a 'git lost' command to block annoying sources)."

In the end, it is hard to see this problem being solved by either more rules or better tools. The creation of kernel patches continues at an increasing pace; the kernel community has to keep up with the flow somehow or suffer in the long run. In many cases, what may really be needed is more maintainers; some subsystems are now maintained by groups, but most of them are still managed by a single developer. Spreading the load would allow some maintainers to work on merge window issues while others keep track of the patch flow.

Such a change would require maintainers to allow others into their often fiercely guarded domains, though; the groups would also have to put time into developing a workflow that would work for them. It is not a simple or immediate solution, and it still will not address ills like developers who repost lengthy patch sets multiple times in one day. So, it seems, maintainers will still just have to get grumpy occasionally when developers push the boundaries too hard.

Comments (5 posted)

User namespaces and setgroups()

By Jonathan Corbet
December 17, 2014

Back in November, we looked at a patch that would allow unprivileged processes to drop groups from their credentials. After that patch was posted, it was quickly shown that, in some cases, dropping groups leads to an increase in privilege; the patch in question has not been pursued since. But it was also shown that an unprivileged user can already drop groups by making use of user namespaces. It took some time, but namespace developer Eric Biederman has put together a set of patches that, he hopes, will close that vulnerability.

Group membership can be used to restrict privilege in a couple of ways. Access control lists can explicitly block access to a resource on the basis of membership in a particular group. But it is even simpler than that: if a file's protection bits are set for "no group access," a process belonging to that group will be blocked, even if the file is otherwise accessible by the world as a whole. In either case, the ability to drop a group can enable a process to access a resource that would have otherwise been denied to it.

In current kernels, using setgroups() to change a process's group membership is a privileged operation. So unprivileged processes cannot use it to get rid of any inconvenient group memberships. But a process running within a user namespace is privileged inside that namespace, so a setgroups() call there will succeed. It is easy to write a little program that uses clone() to create a child in a user namespace and has the child call setgroups() to drop membership in all supplementary groups. This privilege-escalation vulnerability has become known as CVE-2014-8989.

Eric's fix for this problem starts by disabling the use of setgroups() within a user namespace until a group-ID mapping has been set up for that namespace. That mapping is created by writing the file gid_map in the process's /proc directory; see this article for details on how the mapping files work. Other user- or group-ID-oriented system calls require the existence of a mapping before they will succeed; setgroups() now has that restriction as well.

The biggest part of the patch adds a new control file, called setgroups, to the /proc directory for each process. Writing the string "deny" to that file will disable the setgroups() system call entirely within the namespace containing the relevant process. The CAP_SYS_ADMIN capability is required, so random processes cannot disable setgroups() in the top-level namespace; once again, a process within its own user namespace is privileged (by default) and can make this change successfully. Once setgroups() has been turned off, it cannot be enabled again in that namespace or any of its descendants. The setgroups file can only be written to before the group-ID mapping has been set.

Finally, an unprivileged process can only change the group-ID mapping of a namespace if setgroups() has been disabled. The only thing an unprivileged process can do with the group-ID mapping is to map its own primary group ID to the same ID in the parent namespace; an unprivileged process is not able to remap its supplementary groups. So, with this set of restrictions in place, it essentially become impossible to (1) play tricks with mappings to drop groups, or (2) call setgroups() at all without privilege.

Note that if a privileged process creates a user namespace, it can set up arbitrary mappings for group IDs and decline to disable setgroups(). That would make the dropping of groups within the namespace possible, but, since the process is already privileged, it could do that anyway.

The end result of all this work should be the closing of the vulnerability caused by being able to drop groups within a user namespace. But it highlights one of the hazards that come with the user namespace territory: while it seems possible to contain privilege within a user namespace, there is always the possibility of surprises like this one hiding in the corners of the system. It may be some time yet before we can be truly confident that all of those surprises have been found and that the unprivileged creation of user namespaces is truly a safe thing to allow.

Eric has asked Linus to pull these changes for the 3.19 development cycle; that pull happened just as this week's Edition was going to press. The patches have been marked for stable backporting as well, so they should eventually become available in the stable update series.

Comments (7 posted)

Greg KH Linux 3.18.1 ?

Greg KH Linux 3.17.7 ?

Greg KH Linux 3.14.27 ?

Jiri Slaby Linux 3.12.35 ?

Greg KH Linux 3.10.63 ?

Ben Hutchings Linux 3.2.65 ?

Wang Nan ARM: kprobes: OPTPROBES and other improvements. ?

Vincent Yang Support for Fujitsu MB86S7X SoCs ?

Feng Wu Add VT-d Posted-Interrupts support ?

Kevin Cernekee Generic BMIPS kernel ?

Christian Borntraeger ACCESS_ONCE and non-scalar accesses ?

Alexander Duyck [net-next PATCH v7 resubmit 0/4] arch: Add lightweight memory barriers for coherent memory access ?

Con Kolivas BFS CPU scheduler v0.460 for linux-3.18 ?

Thomas Graf rhashtable: Per bucket locks & deferred table resizing ?

Rakib Mullick BLD-3.18 release. ?

Martin KaFai Lau tcp: TCP tracer ?

Sergei Shtylyov extcon: add MAX3355 driver ?

Pankaj Dubey Introducing Exynos ChipId driver ?

Krzysztof Kozlowski devfreq: exynos: Add driver for Exynos3250 ?

Jaewon Kim Add regulator-haptic driver ?

Jarkko Sakkinen TPM 2.0 support ?

Stanimir Varbanov Qualcomm PCIe and PCIe/PHY drivers ?

Benoit Parrot gpio: add GPIO hogging mechanism ?

Ding Tianhong add hisilicon hip04 ethernet driver ?

Mandeep Sandhu uio hotplug support ?

Sneeker Yeh Add support for Fujitsu USB host controller ?

Tomeu Vizoso Add support for Tegra Activity Monitor ?

NeilBrown Add support for 'tty-slaves' described by devicetree. ?

Chanwoo Choi [PATCHv4 0/8] devfreq: Add devfreq-event class to provide raw data for devfreq device ?

atull@opensource.altera.com FPGA Manager Framework ?

Geert Uytterhoeven Add Simple Power-Managed Bus Support ?

Mitchel Humpherys iopoll: Introduce memory-mapped IO polling macros ?

Kevin Cernekee Stop maintainer abuse ?

Eric W. Biederman [CFT][PATCH v6] userns: Add a knob to disable setgroups on a per user namespace basis ?

Luis R. Rodriguez x86: add xen hypercall preemption ?

Josh Poimboeuf Kernel Live Patching ?

Adrian Hunter perf tools: Introduce an abstraction for Instruction Tracing ?

Kernel development

Brief items

Kernel release status

Quotes of the week

Scalability Techniques for Practical Synchronization Primitives (ACM Queue)

nftables 0.4 released

Kernel development news

3.19 Merge window part 2

On the problem of maintainer abuse

User namespaces and setgroups()

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Device driver infrastructure

Documentation

Security-related

Virtualization and containers

Miscellaneous