Kernel development
Brief items
Kernel release status
The 4.6 merge window remains open; see the article below for a summary of the (considerable) work that has been merged in the last week.Stable updates: none have been released in the last week, and none are in the review process as of this writing.
Quote of the week
Kernel development news
4.6 Merge window part 2
As of this writing, Linus has pulled 11,118 non-merge changesets into the mainline repository for the 4.6 development cycle; just over 10,000 of those came since last week's summary. As can be seen, the flow of patches can no longer be described as "slow." A number of significant features can be found in that flood of patches.The most notable user-visible changes include:
- Support for memory protection keys
has been merged. This is an Intel feature allowing user space to
partition its memory into zones and apply additional access
restrictions to each. The system calls for the manipulation of memory
protection keys have not yet been merged; they are waiting for a bit more
review. But keys will be used by the kernel in 4.6 to implement truly
execute-only
memory that cannot be read by the executing process.
- Control groups are now namespace-aware; there is a new
CLONE_NEWCGROUP flag to clone() to create a process
in a new control-group namespace. See this
patch for documentation on this new feature. The control-group
filesystem can also now be mounted within user namespaces.
- The preadv2() and
pwritev2() system calls, which take an extra "flags"
argument, have finally been merged, allowing for the addition of new
functionality. The first flag is RWF_HIPRI, which enables
the use of polling for a high-priority request.
- Page poisoning has traditionally been a kernel debugging feature; it
fills freed pages with a special pattern that is easy to spot when
looking for things that went wrong. In 4.6, poisoning can be enabled
independently of the debugging options, and the "poison" value can be
set to zero; this results in pages being simply cleared when they are
freed. This behavior, inspired by the grsecurity/PaX patches, reduces
the chances of the kernel leaking sensitive data.
- The memory-management subsystem's thrash-detection code has never
worked properly within control groups; that has been rectified. The
result should be better behavior when specific control groups are
experiencing memory pressure.
- The integrity measurement architecture (IMA) subsystem now requires
that its policy be signed, and the integrity of that policy is
measured prior to loading.
- The ARM64 architecture now supports the "user access override" feature
found in ARMv8.2. It allows user space to be accessed (by the kernel)
using ordinary unprivileged instructions that check the owning
process's
permissions in the normal way. That, in turn, offers extra protection
against the kernel being fooled into accessing memory it shouldn't.
- ARM64 also now supports kernel
address-space layout randomization.
- The kernel's representation of general-purpose I/O (GPIO) devices has
been massively reworked; the gpio_chip structure is a proper
device within the device model now. There is a new ABI for getting
information about the GPIOs on the system, but some work remains to be
done. As Linus Walleij noted:
"
We can now discover GPIOs properly from userspace. We still have not come up with a way to actually *use* GPIOs from userspace.
" See tools/gpio/lsgpio.c for an example of the new ABI; note that the old sysfs-based ABI is now considered obsolete (even though it has not yet been completely replaced). - A process's timer slack value — the amount by which timer requests may
be delayed to cause them to coincide with others — can now be seen and
modified via /proc/PID/timerslack_ns.
- The extended BPF virtual machine now implements per-CPU maps for
high-speed statistics collection. There is also a new map type to
store stack traces.
- There is a new network-control API called "devlink," intended for the
setting of various parameters that are not related to any specific
device class. This protocol is, naturally, undocumented; some
information can be found in this
merge changelog.
- The kernel connection multiplexer,
which allows for certain types of higher-level protocol handling in
the kernel, has been merged.
- A number of network-oriented sysctl knobs (tcp_syn_retries,
tcp_synack_retries,
tcp_syncookies,
tcp_reordering,
tcp_retries1,
tcp_retries2,
tcp_orphan_retries,
tcp_fin_timeout,
tcp_notsent_lowat,
igmp_max_memberships,
igmp_max_msf,
igmp_llm_reports, and
igmp_qrv) have been made network-namespace aware, so
that different namespaces can have different values.
- The "local checksum offload" mechanism (described in this article) has been merged. Local
checksum offload speeds checksum calculations, making tunneled
protocol implementations faster. See
Documentation/networking/checksum-offloads.txt for
more information.
- Netlink support over shared memory segments has been removed; it has
never worked correctly and there does not appear to be any user-space code
using it.
- A couple of new filesystem ioctl() commands
(Q_GETNEXTQUOTA and Q_XGETNEXTQUOTA) have been added
to enable efficient iteration through all of the disk quotas on a
filesystem.
- The Btrfs filesystem has a new mount option, nologreplay,
which prevents the replaying of the log tree; this can be used with
ro to obtain a truly read-only mount. The new mount option
usebackuproot is meant to replace the existing
recovery option.
- New hardware support includes:
- Audio:
Maxim MAX9867 and max98926 codecs,
Realtek RT5514 codecs,
AMD audio coprocessors, and
Allwinner A10 S/PDIF controllers.
- GPIO:
WinSystems WS16C48 GPIO controllers,
ACCES 104-DIO-48E GPIO controllers,
Technologic TS-4800 FPGA GPIO controllers,
TI TPIC2810 8-Bit I2C GPO expanders,
TI TPS65218 GPIO controllers,
TI TPS65086 GPO controllers, and
MEN 16Z127 GPIO controllers.
- Input:
BYD BTP10463 touchpads,
MELFAS MIP4 touchscreens,
Freescale i.MX25 integrated touchscreens, and
numerous devices using the Synaptics "Register Mapped Interface"
protocol.
- Media:
TI "camera adaptation layer" capture engines.
- Miscellaneous:
Xilinx NWL PCIe controllers,
Cavium ThunderX PEM PCIe host controllers,
Microchip PIC32 random number generators,
ST Microelectronics adjunct processors,
Qualcomm HIDMA DMA engines,
Active-semi ACT8945A charger controllers,
NXP LPC18XX EEPROM memory,
BCM2835 auxiliar mini UARTs,
Marvell EBU serial ports,
Moxa SmartIO MUE multiport serial cards,
AT91 SAMA5D2 analog to digital converters (ADCs),
Texas Instruments ADC0831/ADC0832/ADC0834/ADC0838 ADCs,
Texas Instruments ADS1015 ADCs,
Freescale MX25 ADCs,
Analog Devices AD5761/61R/21/21R digital to analog converters,
Freescale MPL115A1 pressure sensors,
Atlas Scientific pH-SM sensors,
TI AFE4404 heart rate and pulse oximeter sensors,
TI AFE4403 heart rate monitors,
TI TPS65086 power management integrated chips,
APM SoC X-Gene SLIMpro mailbox controllers,
Rockchip SoC integrated mailboxes,
Hisilicon Hi6220 mailboxes,
ARM high-definition color LCD controllers,
Microchip PIC32MZDA SDHCI controllers, and
MediaTek M4U I/O memory-management units.
- Network:
MediaTek MT7623 Gigabit Ethernet controllers and
Intel Ethernet X722 iWARP cards.
- USB:
Rockchip EMMC PHYs and
Rockchip DisplayPort PHYs.
- Watchdog: Intel MEI iAMT watchdogs, National Instruments 903x/913x watchdog timers, WinSystems EBC-C384 watchdog timers, and ARM SBSA generic watchdogs.
- Audio:
Maxim MAX9867 and max98926 codecs,
Realtek RT5514 codecs,
AMD audio coprocessors, and
Allwinner A10 S/PDIF controllers.
Changes visible to kernel developers include:
- The compile-time stack validation
patches have been merged, providing a tool that ensures that the
call stack will always be valid. The result will be more reliable
stack traces for developers; this feature is also needed for the
further development of the live-patching mechanism.
- There is a new function for freeing a set of objects:
void kfree_bulk(size_t size, void **objects);It differs from kmem_cache_free_bulk() in that there is no pointer to a kmem_cache structure, meaning that objects from multiple slabs can be freed together. There is a cost to doing things this way, though, so kmem_cache_free_bulk() is preferred in cases where it is applicable.
- The cpufreq subsystem, charged with setting CPU frequencies to match
the current system load, has seen some significant changes. In
current kernels, it uses timers to periodically sample the load on the
CPU and, perhaps, make changes. As of 4.6, instead, the cpufreq
governors will be called directly from the scheduler when things
change, eliminating the timers. Eventually the governors will also
use the projected load information from the scheduler to make
(hopefully) better decisions, but that is work for a future
development cycle.
- sscanf() now has basic support for matching sets of
characters using the %[ operator (e.g. "%[abc]" to
match any of abc). Only literal sets can be
matched; there is, for example, no special meaning for "-"
within a character set.
- The new dtx_diff tool, in the scripts/dtc directory,
can calculate the differences between device trees in a number of
formats.
- The generic code supporting encrypted filesystems has been moved into
the VFS layer (in fs/crypto) so that it can be used beyond
the ext4 and f2fs filesystems.
- The I2C subsystem has a new pin-controller-based bus demultiplexor
allowing runtime selection between multiple I2C controllers. See i2c-demux-pinctrl.txt for an
overview.
- The "kcov" kernel code-coverage analyzer has been merged; it can be useful to ensure that fuzzing and other testing efforts have exercised as much code as possible. See Documentation/kcov.txt for more information.
At this point, it would appear that the bulk of the changes for this development cycle have been merged. The merge window will likely stay open through March 27, though, so one never knows whether something else of interest might turn up. Next week's Kernel Page will summarize any significant changes that appear at the tail end of the 4.6 merge window.
A case for variant symlinks
Variant symlinks are symbolic links that behave differently depending on details of the process that reads or follows the link. They have a history going back at least to the 1980s when various vendors of Unix systems wanted to be compatible with both BSD Unix from UCB (The University of California at Berkeley), and System V Unix from AT&T. Details varied, but the core idea was that some attribute of a process could be used to modify the target of a symlink or to select among multiple options. This would allow, for example, some processes to see /bin as a symlink to /.ucbbin, while others would see /.attbin.
While those issues are long behind us, the desire for variant symlinks still pops up from time to time, most recently in a proposal by Cole Minnaar for a "Variant Symlink Filesystem". The proposed filesystem — currently implemented as an out-of-tree kernel module — takes an extremely simple approach to the problem. The filesystem provides a single directory that contains a single symlink called resolve. When any process reads or follows this link, the filesystem looks though that process's environment for a particular environment variable, specified when the filesystem is mounted, and reports the value of that variable as the content of the symlink.
To use this you would mount a filesystem at some well known location and create links that pass though that location. For example
# mount -t varsymfs -o UNIVERSE none /.universe
# ln -s /.universe/resolve/bin /bin
Then:
$ UNIVERSE=/att ls -lL /bin
would show the contents of /att/bin,
while:
$ UNIVERSE=/ucb ls -lL /bin
would show those of /ucb/bin.
The responses to this proposal were pretty much as would be expected: Minnaar was told that he should use a FUSE filesystem written in user space, or use mount namespaces to give different processes a different view of the system. Minnaar made it clear that this wasn't just a new idea with no history, but was something he has been working on for some time. Both those ideas had been tried and found wanting.
The problem with a FUSE-based solution is performance. Though the special filesystem is not used for any filesystem I/O and is only needed to look up a single symbolic link, a performance decrease can still be measured. By its nature, a variant symbolic link cannot be cached in the VFS layer, so every request would need to go to user space and back into the kernel. Recent work has made symlink lookup largely lockless because, for some workloads, even requiring spinlocks for following a symlink can be too expensive. This was found to be particularly true when compiling code, since searching for include files generates lots of filename lookups. Minnaar identified compilation as a problematic case for FUSE-based variant symlinks too, and even the cost of that spinlock — sufficient to justify a rewrite of the symlink lookup code — is tiny compared to the cost of scheduling a user-space process to provide an answer.
The situation with mount namespaces brings its own set of problems, though of a very different kind. A large part of the focus on namespaces has been the creation of containers to contain processes — once in a container, the process shouldn't be able to get out. Minnaar is not interested in that side at all. He is interested in convenience rather than containment.
The example he sketched was to support multiple versions of packages that require the use of fixed paths. Many packages, such as Perl and Emacs, include a version number in the path names used for finding support files, such as /usr/lib/perl5/site_perl/5.22.1. This allows multiple versions to be installed side by side. Many other packages are not so enlightened, allowing only one version to be installed at a time. It would be possible to fix such packages to support parallel installations, but it seems it was easier to implement variant symlinks. That way each package can behave as though it owns the standard path names and each user can select their preferred package version by setting up some environment variables.
When it comes to convenience, filesystem namespaces have two problems, one that was mentioned and one that wasn't — yet. The first problem is that the Unix shell doesn't have a "chns" command to change namespaces. While you can certainly use nsenter, as David Lang suggested, this creates a new shell rather than adjusting the state of the old shell. There is a good reason that cd or chdir is built into the shell — having it external would be nowhere near as convenient. In the same way, nsenter would only be as convenient as export UNIVERSE=/att if it was built-in.
The second problem is the inevitable combinatorial explosion that namespaces would cause. If there is only a need to select on one axis, ucb or att, then namespaces could be made to work. If independently selecting between versions of a dozen packages is needed, then there would be a need for potentially thousands of namespaces, one for each combination. In practice, this explosion may not happen, but the need to construct namespaces on demand might not be the most convenient approach.
While variant symlinks may well be useful, it would help to have a
variety of concrete use-cases to examine so that we could see exactly
how they would be used and informed implementation choices could be
made. It is easy to "bikeshed" some variations, like whether a
constant
prefix should be provided at mount time so the environment variable
values don't need to start with "/". However, such bikeshedding is
likely to focus on the inconsequential and miss the essential. What
we need, as Al Viro indicated, is to ask "the right questions for
figuring out what requirements
" there are, so as to determine
"the best way to do it
". Whether anything like that
occurs remains to be seen.
Understanding the new control groups API
After many years, the Linux kernel's control group (cgroup) infrastructure is undergoing a rewrite that makes changes to the API in a number of places. Understanding the changes is important to developers, particularly those working with containerization projects. This article will look at the new features of cgroups v2, which were recently declared production-ready in kernel 4.5. It is based on a talk I gave at the recent Netdev 1.1 conference in Seville, Spain. The video [YouTube] for that talk is now available online.
Background
The cgroup subsystem and associated controllers handle management and accounting of various system resources like CPU, memory, I/O, and more. Together with the Linux namespace subsystem, which is a bit older (having started around 2002) and is considered a bit more mature (apart, perhaps, from user namespaces, which still raise discussions), these subsystems form the basis of Linux containers. Currently, most projects involving Linux containers, like Docker, LXC, OpenVZ, Kubernetes, and others, are based on both of them.
The development of the Linux cgroup subsystem started in 2006 at Google, led primarily by Rohit Seth and Paul Menage. Initially the project was called "Process Containers", but later on the name was changed to "Control Groups", to avoid confusion with Linux containers, and nowadays everybody calls them "cgroups" for short.
There are currently 12 cgroup controllers in cgroups v1; all—except one—have existed for several years. The new addition is the PIDs controller, developed by Aditya Kali and merged in kernel 4.3. It allows restricting the number of processes created inside a control group, and it can be used as an anti-fork-bomb solution. The PID space in Linux consists of, at a maximum, about four million PIDs (PID_MAX_LIMIT). Given today's RAM capacities, this limit could easily and quite quickly be exhausted by a fork bomb from within a single container. The PIDs controller is supported by both cgroups v1 and cgroups v2.
Over the years, there was a lot of criticism about the implementation of cgroups, which seems to present a number of inconsistencies and a lot of chaos. For example, when creating subgroups (cgroups within cgroups), several cgroup controllers propagate parameters to their immediate subgroups, while other controllers do not. Or, for a different example, some controllers use interface files (such as the cpuset controller's clone_children) that appear in all controllers even though they only affect one.
As maintainer Tejun Heo himself has
admitted [YouTube], "design followed implementation", "different decisions were
taken for different controllers", and "sometimes too much flexibility
causes a hindrance". In an LWN article
from 2012, it was said
that "control groups are one of those features that kernel developers
love to hate.
"
Migration
The cgroups v2 interface was declared non-experimental in kernel 4.5. However, the cgroups v1 subsystem was not removed from the kernel, so, after the system boots, both cgroups v1 and cgroups v2 are enabled by default. You can use a mixture of both of them, although you cannot use the same type of controller in both cgroups v1 and in cgroups v2 at the same time.
It is worth mentioning that there is a patch that adds a kernel command-line option for disabling cgroups v1 controllers (cgroup_no_v1), which was merged for kernel 4.6.
Kernel support for cgroups v1 will probably still exist for at least several more years, as long as there are user-space applications that use it—quite like what we had in the past with iptables and ipchains, and what we observe now with iptables and nftables. Some user-space applications have already started migration to cgroups v2—for example, systemd and CGManager.
Both versions of cgroups are controlled by way of a synthetic filesystem that gets mounted by the user. During the last three years or so, a special mount option was available in cgroups v1 (__DEVEL__sane_behavior). This mount option enabled using certain experimental features, some of which formed the basis of cgroups v2 (the option was removed in kernel 4.5, however). For example, using this mount option forces the use the unified hierarchy mode, in which controller management is handled similarly to how it is done in cgroups v2. The __DEVEL__sane_behavior mount option is mutually exclusive with the mount options that were removed in cgroups v2, like noprefix, clone_children, release_agent, and more.
Systemd started to use cgroups for service management rather than for resource management many years ago. Each systemd service is mapped to a separate control group. However, the migration of systemd to cgroups v2 is still partial, as it uses the __DEVEL__sane_behavior mount option. Also, in CGManager, current support for cgroups v2 is partial: it is available only when using Upstart, and not when using systemd.
Currently, three cgroup controllers are available in cgroups v2: I/O, memory, and PIDs. There are already patches and discussions in the cgroups mailing list about adding the CPU controller as well.
There are also interesting patches adding support for resource groups, posted just last week by Heo. In cgroups v1, you could assign threads of the same process to different cgroups, but this is not possible in cgroups v2. As a result, in-process resource-management abilities, like the ability to control CPU cycle distribution hierarchically between the threads of a process, is missing, as all of the threads belong to a single cgroup. With the suggested resource groups (rgroups) infrastructure, this ability can be implemented as a natural extension of the setpriority() system call.
Details of the cgroups v2 interface
Mounting cgroups v2 is done as follows:
mount -t cgroup2 none $MOUNT_POINT
Note that the type argument (following -t) specified has changed; cgroups v1 used -t cgroup. As in cgroups v1, the mount point can be anywhere in the filesystem. But, in contrast, there are no mount options at all in cgroups v2. One could use mount options to enable controllers in cgroups v1, but in cgroups v2 this is done differently, as we will see below. Creation of new subgroups in cgroups v2 is done with mkdir groupName, and removal is done with rmdir groupName.
After mounting cgroups v2, a cgroup root object is created, with three cgroup core interface files beneath it. For example, if cgroups v2 is mounted on /sys/fs/cgroup2, the following files are created under that directory:
- cgroup.controllers – This shows the supported cgroup controllers. All v2 controllers not bound to a v1 hierarchy are automatically bound to the v2 hierarchy, and show up in cgroup.controllers of the cgroup root object.
- cgroup.procs – When the the cgroup filesystem is first mounted, cgroup.procs in the root cgroup contains the list of PIDs of all processes in the system, excluding zombie processes. For each newly created subgroup, the cgroup.procs is empty, as no process is attached to the newly created group. Attaching a process to a subgroup is done by writing its PID into the subgroup's cgroup.procs.
- cgroup.subtree_control – This holds the controllers that are
enabled for the immediate subgroups. This entry is empty just after mount, as no controllers
are enabled by default. Enabling and disabling controllers in the
immediate subgroups of a parent is done only by writing into its
cgroup.subtree_control file. So, for example, enabling the memory
controller is done by:
echo "+memory" > /sys/fs/cgroup2/cgroup.subtree_controland disabling it is done by:
echo "-memory" > /sys/fs/cgroup2/cgroup.subtree_controlYou can enable/disable more than one controller in the same command line.
These three cgroup core interface files are also created for each newly created subgroup. Apart from these three files, a cgroup core interface file called cgroup.events is created. This interface file is unique to non-root subgroups.
The cgroup.events file reflects the number of processes attached to the subgroup, and consists of one item, "populated: value". The value is 0 when there are no processes attached to that subgroup or its descendants, and 1 when there are one or more processes attached to that subgroup or its descendants.
As mentioned, subgroup creation is similar to how it is done in cgroups v1. But in cgroups v2, you can only create subgroups in a single hierarchy, under the cgroups v2 mount point. When a new subgroup is created, the value of the "populated" entry in cgroup.events is 0, as you would expect, as there is no process yet attached to this newly created subgroup.
You can monitor events in this subgroup by calling poll(), inotify(), or dnotify() from user space. Thus, you can be notified notified when those files change, which can be used to determine when the last process attached to a subgroup terminates or when the first process is attached to that subgroup. This mechanism is much more efficient in terms of performance than the parallel mechanism in cgroups v1, the release agent.
It is worth mentioning that this notification mechanism can also be used by controller-specific interface files. For example, the cgroups v2 memory controller has an interface file called memory.events, which enables monitoring memory events like out-of-memory (OOM) in a similar way.
When a new subgroup is created, controller-specific files are created for each enabled controller in this subgroup. For example, when the PIDs controller is enabled, two interface files are created: pids.max and pids.current, for setting a limit on the number of processes forked in that subgroup, and for accounting of the number of processes in that subgroup.
Let's take a look at two diagrams illustrating what we just described. The following sequence mounts cgroups v2 on /cgroup2 and creates a subgroup called "group1", creates two subgroups of group1 ("nested1" and "nested2"), then enables the PIDs controller in group1:
mount -t cgroup2 nodev /cgroup2
mkdir /cgroup2/group1
mkdir /cgroup2/group1/nested1
mkdir /cgroup2/group1/nested2
echo +pids > /cgroup2/cgroup.subtree_control
The following diagram illustrates the status after running this sequence. We can see that the two PIDs controller interface files, pids.max and pids.current, were created for group1.
Now, if we run:
echo +pids > /cgroup2/group1/cgroup.subtree_control
this will enable the PIDs controller in group1's immediate subgroups, nested1 and nested2. By writing +pids into the subtree_control of the root cgroup, we only enable the PIDs controller in the root's direct child subgroups and no other descendants. As a result, the PIDs-controller–specific files (pids.max and pids.current) are created for both these newly-created subgroups.
The subsequent diagram shows the status after enabling the PIDs controller on group1.
The no-internal-process rule
Unlike in cgroups v1, in cgroups v2 you can attach processes only to leaves. This means that you cannot attach a process to an internal subgroup if it has any controller enabled. The reason behind this rule is that processes in a given subgroup competing for resources with threads attached to its parent group create significant implementation difficulties.
The following diagram illustrates this.
(Note: when you write 0 into cgroup.procs, this will write the PID of the process performing the writing into the file.)
The documentation discusses the no-internal-process rule in more detail.
In cgroups v1, a process can belong to many subgroups, if those subgroups are in different hierarchies with different controllers attached. But, because belonging to more than one subgroup made it difficult to disambiguate subgroup membership, in cgroups v2, a process can belong only to a single subgroup.
We will look at an example when this restriction is important. In cgroups v1, there are two network controllers: net_prio (written by Neil Horman) and net_cls (by Thomas Graf). These controllers were not extended to support cgroups v2. Instead, the xt_cgroup netfilter matching module was extended to support matching by a cgroup path. For example, the following iptables rule matches traffic that was generated by a socket created in a process attached to mygroup (or its descendants):
iptables -A OUTPUT -m cgroup --path mygroup -j LOG
Such a match is not possible in cgroups v1, because sometimes a process can belong to more than a single subgroup. In cgroups v2, this problem does not exist, because of the single-subgroup rule.
Summary
Work is ongoing; in addition to the resource-group patches mentioned earlier, there are patches for a new RDMA cgroup controller that are currently in the pipeline. This patch set allows resource accounting and limit enforcement on a per-cgroup, per-RDMA-device basis. These patches are in the post-RFC phase, and are in the ninth iteration as of this writing; it seems likely that they are to be merged soon.
As we have seen, the new interface of cgroups v2, which was recently declared stable in the kernel, has several advantages over cgroups v1, such as its notification-to-user-space mechanism. Although the cgroups v2 implementation is still in its initial stages, it seems to be much better organized and more consistent than cgroups v1.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Device drivers
Device driver infrastructure
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
