Kernel development
Brief items
Kernel release status
The 3.19 merge window is still open; see the separate article below for a summary of the (many) changes merged in the last week.Stable updates: 3.18.1, 3.17.7, 3.14.27, and 3.10.63 were released on December 16.
Quotes of the week
Scalability Techniques for Practical Synchronization Primitives (ACM Queue)
Davidlohr Bueso gives an overview of kernel locking scalability techniques in this ACM Queue article. "There have recently been significant efforts to address lock-scaling issues in the Linux kernel on large high-end servers. Many of the problems and solutions apply to similar system software. This article applies general ideas and lessons learned to a wider systems context, in the hope that it can be helpful to people who are encountering similar scaling problems."
nftables 0.4 released
For those of you following the development of nftables (the virtual-machine-based eventual replacement for iptables) version 0.4 of the user-space nftables utility is out. It provides access to a lot of new features, including global ruleset operations, improved logging support, masquerading and NAT, redirect support (will need a 3.19 kernel), and a lot of fixes.
Kernel development news
3.19 Merge window part 2
Last week's 3.19 merge window summary noted that things had gotten off to a slow start. Linus has made up for lost time since then, though; as of this writing, just over 10,400 changesets have been pulled into the mainline repository — over 8,000 since last week. Needless to say, those changes represent a great deal of fixes and new work. The most significant user-visible changes include:
- The networking layer has a new subsystem for offloading switching
and routing duties to suitably capable hardware.
- The NFS client and server both now support the NFS 4.2 ALLOCATE and
DEALLOCATE
options. The former can be used to request preallocation of storage
for a file, while the latter is useful for punching holes.
- The f2fs filesystem has a new "fastboot" option that shorts
out a number of boot-time checks.
- Filters used with the ftrace subsystem now support the logical NOT
("!") operator in expressions.
- Device tree overlay support has been
merged. This feature should make life easier for developers working
on systems with "shields" or other types of daughterboards that need
to be worked into the device tree at system boot time.
- There is a new getsockopt() option called
SO_INCOMING_CPU. It returns the CPU on which processing for
the given socket is happening. When used with multi-queue hardware on
large systems, this option can allow an application to divide work
across processors, maximizing throughput.
- It is now possible to attach enhanced BPF
programs to network sockets. For now, this capability can only be
used for statistics gathering, but other applications should become
possible in future development cycles.
- The new "ipvlan" driver enable the creation of virtual network devices
for container interconnection. It is designed to work well with
network namespaces. Ipvlan is much like the existing macvlan driver,
but it does its multiplexing at a higher level in the stack.
- The Btrfs filesystem's RAID5 and RAID6 implementation finally has
support for disk scrubbing and replacement.
- The execveat() system call has been merged. Like the other
"at" system calls, it takes a file descriptor for the directory to be
used as the starting point for finding the executable file. It can
also be used to execute a binary file directly from an open file
descriptor, allowing for a better implementation of the
fexecve() system call found on other Unix-like systems.
- The squashfs filesystem now supports compression with the LZ4
algorithm.
- The "AMD KFD" driver has been merged; it provides a new interface to
graphical processors for non-graphics (e.g. GPGPU) applications.
- Some complaints on the mailing lists notwithstanding, the Android
"binder" code has been moved from the staging tree into the kernel
proper. In the end, it's an API that has been shipped in millions of
systems and has to be supported somehow.
- New hardware support includes:
- Audio:
Intel Baytrail-based audio devices,
Samsung Exynos7 I2S controllers,
NXP Semiconductors TFA9879 amplifiers, and
Texas Instruments TS3A227E headset chips.
- Graphics:
Sharp LQ101R1SX01 panels,
Freescale i.MX GPUs (staging graduation),
R-Car DU HDMI encoders,
Analog Device ADV7511(W) and ADV7513 HDMI encoders, and
Rockchip SoC-based GPUs.
- IIO:
Silicon Labs Si7013/20/21 humidity/temperature sensors,
Bosch Sensortec BMP280 pressure sensors, and
Qualcomm SPMI PMIC current analog-to-digital converters.
- Miscellaneous:
Dallas/Maxim DS1374 watchdog timers,
Freescale Layerscape PCIe controllers,
Qualcomm SPMI PMIC pin controllers,
Intel Cherryview/Braswell pin controllers,
IMG synchronous peripheral flash interfaces,
IMG I2C serial control bus controllers,
Amlogic Meson I2C controllers,
Amlogic Meson SPI flash controllers,
ACPI "platform communication channel" devices,
IBM OPAL real-time clocks,
IBM PowerNV OPAL IPMI interfaces,
IPMI controllers connected via SMBUS,
TI OMAP internal UARTs,
Xilinx Clocking Wizard clock generators (staging), and
TI LP8860 4 channel LED controllers.
- Networking:
Marvell 88E6352 ethernet switch chips and
Rocker network switches.
- USB:
Broadcom USB3.0 device controllers,
STMicroelectronics MIPHY28LP PHYs, and
Marvell Berlin USB PHYs.
- Video4Linux: DVBSky S950 V3 video bridges, Montage M88RS6000 internal tuners, Panasonic MN88472 and MN88473 demodulators, and Amlogic Meson IR remote receivers.
- Audio:
Intel Baytrail-based audio devices,
Samsung Exynos7 I2S controllers,
NXP Semiconductors TFA9879 amplifiers, and
Texas Instruments TS3A227E headset chips.
Changes visible to kernel developers include:
- The x86 memory-management code now makes fuller use of the page
attribute table (PAT) modes offered by current processors. In
particular, this change enables the use of write-through caching.
- There is a new API that allows drivers to obtain device property
information from either ACPI or a device tree without having to know
which is in use. See this
commit for a brief overview of the new calls and this
commit for a related interface for use when no device
structure is present.
- The virtio subsystem has seen a lot of work to make it comply with the
virtio 1.0 standard.
- The I2C subsystem can now enable a Linux system to act like an I2C
slave if the bus controller supports that mode. Documentation is
nonexistent, but we are promised that it will show up before the end
of the cycle.
- The GPIO subsystem can now change the values of multiple GPIO outputs
simultaneously — if the hardware supports it, of course. See the
documentation changes at the top of this
commit for a list of the API additions to support this
functionality.
- The owner field has been dropped from struct
platform_driver, leading to extensive tree-wide changes to remove
all uses of that field.
- Support for the ARM "Coresight" tracing mechanism has been added to
the kernel. See Documentation/trace/coresight.txt for
information about this subsystem and how to work with it.
- Atomic modesetting support has been added to the direct rendering layer; this feature allows multiple graphical mode parameters to be set in a single, atomic operation. See this merge commit for an overview of what's provided. One important thing that is still missing is the actual ioctl() to provide the feature to user space; that will likely come in 3.20 along with more driver support.
As always, the kernel has a wide variety of contributors. While it is often hard to tell a contributor's age from their posted patches, your editor is confident that this patch is the first from a four-year-old to ever make it into the kernel.
At this point, most of the major trees (from contributors of all ages) have been pulled, so the rate of change in the mainline repository can be expected to slow. That said, the merge window will probably remain open until December 21. Next week's summary will cover the final patches that are pulled for the 3.19 development cycle.
On the problem of maintainer abuse
As can be seen in the LWN kernel patch tracker (or in the patches section of the weekly Kernel Page), there have been a lot of significant patch sets posted over the course of the last week or so. The pace of kernel development continues to increase, so there is always a lot of new code out there in need of review. There's just one little potential problem: as of this writing, the 3.19 merge window is open, and many subsystem maintainers are busy getting the current set of changes into the mainline and dealing with any resulting fallout. They are unlikely to have much spare time for patch review.Merge-window patch postings are not uncommon; most maintainers either just defer looking at them or ignore them altogether. This time around, though, Thomas Gleixner vented his frustration on the linux-kernel list:
Posting patches during the merge window, he said, constitutes "maintainer abuse."
Some developers agreed with these sentiments, and Kevin Cernekee promptly posted (during the merge window, of course) a patch series titled "Stop maintainer abuse" trying to codify a rule that patches should not be posted during merge windows. Patches meant for the next merge window should, by these rules, be posted prior to the preceding -rc5 release; patches posted after -rc5 comes out will end up coming out one release later. And no patches at all, other than urgent fixes, should be posted while the merge window is open.
There is a potential problem, though, in that not all subsystem maintainers work the same way. Christoph Hellwig disagreed with the rules, saying:
Linus responded to the posting guidelines by noting that they don't apply equally to all maintainers:
Alan Cox worried that trying to put a lid on patch postings is always doomed to failure:
His suggestion was to, instead, try to streamline the process a bit, mostly
by improving the patchwork
system to automate (or at least assist) many common maintainer duties. He
finished by proposing that: "It could then be integrated into git
(if only so we can have a 'git lost' command to block annoying
sources).
"
In the end, it is hard to see this problem being solved by either more rules or better tools. The creation of kernel patches continues at an increasing pace; the kernel community has to keep up with the flow somehow or suffer in the long run. In many cases, what may really be needed is more maintainers; some subsystems are now maintained by groups, but most of them are still managed by a single developer. Spreading the load would allow some maintainers to work on merge window issues while others keep track of the patch flow.
Such a change would require maintainers to allow others into their often fiercely guarded domains, though; the groups would also have to put time into developing a workflow that would work for them. It is not a simple or immediate solution, and it still will not address ills like developers who repost lengthy patch sets multiple times in one day. So, it seems, maintainers will still just have to get grumpy occasionally when developers push the boundaries too hard.
User namespaces and setgroups()
Back in November, we looked at a patch that would allow unprivileged processes to drop groups from their credentials. After that patch was posted, it was quickly shown that, in some cases, dropping groups leads to an increase in privilege; the patch in question has not been pursued since. But it was also shown that an unprivileged user can already drop groups by making use of user namespaces. It took some time, but namespace developer Eric Biederman has put together a set of patches that, he hopes, will close that vulnerability.Group membership can be used to restrict privilege in a couple of ways. Access control lists can explicitly block access to a resource on the basis of membership in a particular group. But it is even simpler than that: if a file's protection bits are set for "no group access," a process belonging to that group will be blocked, even if the file is otherwise accessible by the world as a whole. In either case, the ability to drop a group can enable a process to access a resource that would have otherwise been denied to it.
In current kernels, using setgroups() to change a process's group membership is a privileged operation. So unprivileged processes cannot use it to get rid of any inconvenient group memberships. But a process running within a user namespace is privileged inside that namespace, so a setgroups() call there will succeed. It is easy to write a little program that uses clone() to create a child in a user namespace and has the child call setgroups() to drop membership in all supplementary groups. This privilege-escalation vulnerability has become known as CVE-2014-8989.
Eric's fix for this problem starts by disabling the use of setgroups() within a user namespace until a group-ID mapping has been set up for that namespace. That mapping is created by writing the file gid_map in the process's /proc directory; see this article for details on how the mapping files work. Other user- or group-ID-oriented system calls require the existence of a mapping before they will succeed; setgroups() now has that restriction as well.
The biggest part of the patch adds a new control file, called setgroups, to the /proc directory for each process. Writing the string "deny" to that file will disable the setgroups() system call entirely within the namespace containing the relevant process. The CAP_SYS_ADMIN capability is required, so random processes cannot disable setgroups() in the top-level namespace; once again, a process within its own user namespace is privileged (by default) and can make this change successfully. Once setgroups() has been turned off, it cannot be enabled again in that namespace or any of its descendants. The setgroups file can only be written to before the group-ID mapping has been set.
Finally, an unprivileged process can only change the group-ID mapping of a namespace if setgroups() has been disabled. The only thing an unprivileged process can do with the group-ID mapping is to map its own primary group ID to the same ID in the parent namespace; an unprivileged process is not able to remap its supplementary groups. So, with this set of restrictions in place, it essentially become impossible to (1) play tricks with mappings to drop groups, or (2) call setgroups() at all without privilege.
Note that if a privileged process creates a user namespace, it can set up arbitrary mappings for group IDs and decline to disable setgroups(). That would make the dropping of groups within the namespace possible, but, since the process is already privileged, it could do that anyway.
The end result of all this work should be the closing of the vulnerability caused by being able to drop groups within a user namespace. But it highlights one of the hazards that come with the user namespace territory: while it seems possible to contain privilege within a user namespace, there is always the possibility of surprises like this one hiding in the corners of the system. It may be some time yet before we can be truly confident that all of those surprises have been found and that the unprivileged creation of user namespaces is truly a safe thing to allow.
Eric has asked Linus to pull these changes for the 3.19 development cycle; that pull happened just as this week's Edition was going to press. The patches have been marked for stable backporting as well, so they should eventually become available in the stable update series.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Device driver infrastructure
Documentation
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
