Kernel development
Brief items
Kernel release status
The 4.7 merge window remains open as of this writing. See the article below for a summary of the important changes that have been merged for the 4.7 release.Stable updates: 4.5.5, 4.4.11, and 3.14.70 were released on May 18.
A small waitid() change
System-call fuzzing work recently turned up an interesting problem on many Linux systems: depending on which init system is used, a process can create permanent zombie children by cloning a thread, calling ptrace() on it, then exiting. If the init system is not waiting for exiting processes in the right way, that particular thread will go unnoticed and unreaped; unlike other zombies, it will not roam the system searching for a cerebral-matter snack, but it will sit there forever using memory, which is bad enough.This is, technically, not a kernel bug: a call to waitid() (which is what most init implementations evidently use) is not supposed to wait for children in the absence of the __WALL flag. But it is a denial-of-service vector, changing every init implementation is impractical, and, besides, the waitid() system call, unlike wait4(), does not even accept the __WALL flag. So Oleg Nesterov decided that this problem should be fixed in the kernel, even if the bug could be said to be elsewhere.
So, as of 4.7, a call to waitid() will wait for child processes
running under ptrace(). It will also accept the __WALL
flag, but that flag will not actually be necessary to prevent this
particular zombie invasion. As Oleg noted in the
patch changelog, this is an ABI change: "__WCLONE and
__WALL no longer have any meaning for debugger. And I can only
hope that this won't break something, but at least strace/gdb won't
suffer.
" The change seems unlikely to cause problems further
afield, but one never knows; if problems are found during the 4.7
development cycle, this change may have to come back out.
Kernel development news
4.7 Merge window, part 2
As of this writing, Linus has pulled almost 9,900 non-merge changesets into the mainline repository for the 4.7 kernel; that includes some 6,500 since last week's summary was written. After the near-record volume of changes that went into 4.6, the community has slowed down a little — but only a little.Some of the more interesting, user-visible changes pulled this time around include:
- The tracing subsystem has gained support for histogram triggers, which can perform
some types of statistical accumulation in the kernel. This
commit contains the documentation additions.
- The event-filtering code for the tracing subsystem has long been able
to follow a specific list of process IDs; in 4.7, there is a new
event-fork option that will cause newly-created child
processes to be automatically added to the list.
- The LoadPin security module has been
merged. If this module is enabled (not the default), all data loaded
into the kernel (modules, firmware, etc.) must come from a single
trusted device.
- The MIPS architecture now supports kernel address-space layout
randomization.
- The PCI Express "downstream port containment" (DPC) feature is now
supported. DPC allows the containment of uncorrectable errors in
hardware attached via a specific port.
- There is a new option to randomize the ordering of the free lists in
the slab memory allocator; the hope is that more unpredictability will
make attacks harder.
- The out-of-memory detection patch set
has been merged. These patches change how the kernel decides that the
system is out of memory with an eye toward creating more deterministic
and reliable behavior.
- A process's current umask can now be read from a new field in
/proc/PID/status.
- The "device DAX" mechanism allows persistent memory to be presented as
a character device (/dev/dax.X.Y) rather than system memory.
This memory can then be
accessed (and mapped into user space) without the need to place a
filesystem on it.
- New hardware support includes:
- Systems and processors:
ARM V2M-MPS2 Cortex-M prototyping systems,
Oxford Semiconductor OXNAS Family systems-on-chip (SoCs),
ASpeed baseboard management controller SoCs,
LG Electronics LG1K SoCs,
EZchip NPS-based systems, and
Loongson-3A R2 MIPS CPUs.
See also Arnd
Bergmann's description of the new ARM systems for more
information, including the fact that the ASpeed submission was
evidently motivated by an LWN article.
- Block:
Shingled magnetic recording devices using the Zone ATA command
mechanism.
- Graphics:
Analogix ANX78XX video bridges,
ARC PGU display controllers,
Allwinner A10 display engines,
Hisilicon Kirin series frame buffers, and
Mediatek MT8173 display subsystems. See also Daniel
Vetter's summary for a definitive list of improvements to the
Intel graphics drivers in this cycle.
- Industrial I/O:
NXP LPC18xx analog-to-digital and digital-to-analog converter
(ADC/DAC) controllers,
Analog Devices AD5592R/AD5593R ADC/DACs,
Microchip MCP4xxx potentiometers,
HOPERF HP206C barometer/altimeters,
Maxim DS1803 digital potentiometers,
Maxim MAX44000 ambient and infrared proximity sensors,
Bosch BMI160 inertial measurement units,
ROHM BH1780 ambient light sensors,
Vishay VEML6070 UV-A light sensors,
HopeRF HP03 digital pressure/temperature sensors, and
Aosong AM2315 relative humidity and temperature sensors.
- Miscellaneous:
Samsung Exynos SROM memory controllers,
NVIDIA Tegra XUSB pad controllers,
NVIDIA Tegra xHCI host controllers,
NVIDIA Tegra210 ADMA controllers,
Oxford Semiconductor reset controllers,
Intersil/Techwell TW686x-based frame grabber cards,
Microchip PIC32 serial ports,
Microchip PIC32 hardware watchdogs and deadman timers,
Intel Broxton digital signal processors,
Marvell Armada-8K PCIe controllers,
Maxim Semiconductor MAX77620 and MAX20024 power-management ICs,
HiSilicon Hi655X power-management ICs,
Atmel AT91 SAMA5D2-compatible shutdown controllers,
HiSilicon reset controllers,
ARM MPS2 UART controllers,
CoreSight system trace macrocells,
Microchip PIC32 series SPI controllers, and
Renesas watchdog timer controllers.
- Pin control:
Intel Baytrail pin controllers,
Marvell PXA25x pin controllers, and
Broadcom Northstar2 pin controllers.
- USB: USB Type-C connector system software interfaces and Broadcom Northstar USB 2.0 PHYs.
- Systems and processors:
ARM V2M-MPS2 Cortex-M prototyping systems,
Oxford Semiconductor OXNAS Family systems-on-chip (SoCs),
ASpeed baseboard management controller SoCs,
LG Electronics LG1K SoCs,
EZchip NPS-based systems, and
Loongson-3A R2 MIPS CPUs.
See also Arnd
Bergmann's description of the new ARM systems for more
information, including the fact that the ASpeed submission was
evidently motivated by an LWN article.
Changes visible to kernel developers include:
- The "SG pool" code, providing helpers for the allocation of chained
scatter/gather lists, has been moved out of the SCSI code and made
available to the rest of the kernel. No documentation exists, but the
interface can be seen in lib/sg_pool.c.
- The pin control subsystem now offers devm_pinctrl_register(),
allowing drivers to drop a lot of cleanup code.
- The KASan memory debugging tool will
now "quarantine" freed memory, taking it out of use for some time.
The idea is that isolating freed memory in this way will improve the
detection of use-after-free errors. KASan has also gained the ability
to monitor accesses to user-space memory.
- The multi-order radix tree patches have been merged, allowing the radix tree to track address ranges greater than a single page.
At this point, the patch flow into the mainline has slowed considerably; just about all of the major trees have been pulled. The merge window has a few more days to run, though; come back next week for a closing summary for this development cycle.
In search of the right RGB LED interface
One of the roles of the Linux kernel is to provide uniform, abstract interfaces to varying hardware. When a new class of hardware comes along, it can take a while to understand what the best interface would be. This has been seen in recent months with the appearance of nonvolatile memory in large quantities leading to disagreements over the semantics of DAX filesystem access and the handling of hardware errors. The same basic question has arisen, though in a much smaller way, over the best handling of RGB LEDs — triplets of LEDs, each of a different color, which together can produce a wide range of colors and intensities.
Linux already has support for monochrome LEDs, including minimal support for identifying the color of each LED: the name of the LED can, and sometimes does, include the English name of the LED's color (locomo:green:mail, for example). The simplest approach to managing RGB LEDs is to treat them as three independent LEDs with related names. User-space tools can then follow simple conventions to find related LEDs and create interesting colors as required.
There are two reasons to think this may not be the best long-term solution. The first involves integration with the various "triggers" that Linux supports for LEDs. As Jacek Anaszewski from Samsung explains, there are two classes of source information for triggers. One class has the trigger local to the LED, such as "timer" or "oneshot". These triggers are controlled from user space; programming three triggers in concert might be a little clumsy, but it still allows the full functionality to be used.
The other class of source information is from in-kernel events: CPU load, disk drive activity, network device activity, etc. These currently only adjust the brightness or the duty-cycle of the LEDs, but a natural enhancement would be to allow them to adjust color. That would require the kernel to know how specific LEDs work together to produce different colors. A particular example is the heartbeat trigger. On monochrome LEDs this trigger produces a "thump-thump-pause" pattern designed to mimic the human heartbeat with a rate that increases as the load-average on the system increases. Heiner Kallweit has implemented an alternate heartbeat that works with RGB LEDs and uses the color (ranging from green to red) rather than the rate to represent load.
It is easy to imagine other ways that color information could be used to represent such things as acceptable or worrisome activity from various parts of the kernel. Supporting direct connections from those subsystems to a suitable RGB LED may provide a lot of value.
The second reason that the kernel might benefit from an explicit understanding that three LEDs work together is that this understanding is embedded in some hardware. A good example is the LP5523 LED controller [PDF] from TI that can drive up to nine LEDs. This controller is programmable with three separate engines and space to store 96 16-bit instructions. The instructions are general enough to be usable for computing prime numbers. The three engines naturally align with three sets of RGB LEDs, so allowing the kernel interface to represent these triples is likely to make for a better interface. Even when the LEDs are only accessed from user space, it would be helpful if high-level program requests, such as blink rates or brightness transitions, could be described for the three together so they can reliably be synchronized.
As yet there does not seem to be a clear vision for how generic RGB support might work. Kallweit posted some patches back in March but they have some problems. The basic approach is to present the three LEDs as a single LED device that changes all three colors at the same time, so it can be used as though it were a single white LED. The "brightness" value can be given hue and saturation components as well; this allows color to be changed from user space. This triplet of values is encoded in a single sysfs attribute which, as Pavel Machek highlighted, is not generally seen as acceptable.
One argument against this approach is that there are already devices with tri-color LEDs, such as the Nokia N900 and the motion controller for the Sony PlayStation. These currently use three separate LED devices and they need to be able to continue to work the same way when new functionality is added.
Using HSV (Hue Saturation Value) has some appeal as it includes the current brightness as a subset but, for correct mapping to RGB, a "gamma" value needs to be included, and the kernel may not be the best place to be adding that sort of complexity.
After some discussion, Anaszewski came up with a proposal that could make triggers like the color-based load indicator work with individual red, green, and blue LEDs. A single trigger can already apply to multiple LEDs, so the first step would be to assign that colorful "heartbeat" to each of the three LEDs. Then a new sysfs attribute would be used to configure each one to only display the "red", "green", or "blue" component of the signal. While this feels a little clumsy, it would certainly work and is simple to implement and to understand, which are more important considerations.
This doesn't really address the need to be able to program controllers that expect LEDs to be related rather than completely independent. Machek has some ideas on how that might be approached. There isn't a lot of detail; it essentially involves creating a new "pattern" device in sysfs that represents the capabilities for the engine in the controller. It can be configured and then linked to one or more LED devices. This model seems flexible enough to be able to support both software and hardware pattern generators, but without more details (and code) it is hard to judge it fairly.
Keeping the individual LEDs separate, but allowing them to be combined for pattern generation, seems to be a fairly accurate model of how the hardware works, as there is nothing in the hardware controller that forces the LEDs to be mounted close to each other physically. This match between model and reality bodes well for the design being one that could be successful.
While little details like RGB LEDs might not get as much attention as big-ticket items like massive nonvolatile RAM arrays, they are still quite important as they are exactly the sort of thing we can expect to see more of in the mobile-device space. If we ever want these devices to run mainline kernels, we would do well to work on getting support of these devices into mainline first. Or at least a close second.
A multi-order radix tree
Radix trees have been a part of Linux for quite some time; an LWN article from a decade ago explained the structure and functionality of them. The radix tree is a general-purpose data structure that is used by many different components within the kernel. It provides an efficient way to create a key-value store, where the key is an unsigned long, referred to as the index, and the value is a void *. The radix tree also stores a few bits of additional information for each entry in the form of tags.
The most common use of the radix tree is to keep track of a collection of pages. In struct address_space, for example, there is a radix tree called page_tree that tracks the in-memory page-cache pages that are associated with a given inode. The key for page_tree is the page offset (pgoff_t) into the file. For normal files, page_tree will map that key to a void * value which is actually the struct page * for the page-cache page at that file offset. For page-cache pages, the radix-tree tags let us track entries that are dirty and which are marked for writeback.
For inodes that take advantage of the DAX ("direct access") feature, there is no page cache sitting between the user processes and the storage. Hence, for DAX inodes there is no need to keep track of struct page * entries via the page_tree. Instead, for DAX inodes, this same page_tree is used to hold DAX exceptional entries that track the state of the persistent-memory pages used by DAX. On x86_64, these exceptional entries are 64-bit values that store several pieces of information, such as the page size (more on that below), the sector offset within the persistent-memory storage, and some flags. From these exceptional entries, DAX knows which dirty pages need to be flushed from the processor cache when an fsync() is received from user space.
For radix-tree uses where there is a one-to-one mapping between keys and values, such as a page_tree that only tracks PAGE_SIZE page-cache entries and/or DAX entries, this all works perfectly. But what about cases where this one-to-one relationship breaks down?
One example of this breakdown is huge pages. On x86_64, regular pages are 4KiB in size. Linux x86_64 also supports "huge pages" that are 2MiB in size, and the Linux DAX code has explicit support for these 2MiB pages. For the page_tree radix tree, this means that the one-to-one relationship between keys and values may not be sufficient.
There is a desire for a 2MiB page to be tracked as a single entity. There would be a single pointer to the 2MiB worth of data and the tag state would be consistent, so that the kernel can reliably track whether the data is dirty.
Existing solutions
As of kernel 4.5, DAX successfully tracks the value and state of 2MiB pages through the use of DAX exceptional radix-tree entries that reserve a few bits to record whether the DAX entry represents a 4KiB page (RADIX_DAX_PTE) or a 2MiB page (RADIX_DAX_PMD). 2MiB page entries (referred to as "Page Middle Directory" or PMD entries) are always inserted at a 2MiB boundary, so DAX is able to support huge-page entries with the following logic:
pgoff_t pmd_index = DAX_PMD_INDEX(index);
entry = radix_tree_lookup(page_tree, pmd_index);
if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
/* operate on the 2MiB 'entry' at 'pmd_index' */
else
entry = radix_tree_lookup(page_tree, index);
/* operate on the 4KiB 'entry' at 'index' */
This has the obvious cost that for every radix-tree operation there has to be an extra lookup for the entry using pmd_index to first check whether the index is covered by a 2MiB page. This solution is correct, but is somewhat costly in the RADIX_DAX_PTE case where we have to do a radix_tree_lookup() at both pmd_index and index. Having special-case lookups at multiple offsets also does not make for the cleanest code. When 1GiB page support is added to the DAX code, this solution begins to look even worse, because there will have to be yet another special-case lookup.
Another possible alternative that would not involve an extra lookup would be to represent the 2MiB entry via 512 redundant entries, each at a unique index. This would have the property that any lookup for the indices in the 2MiB range would return a copy of the data pointer, but it has the downside that the tag tracking is no longer consistent among the 512 entries. This would mean that some of the 512 entries could be tagged as clean and some of them could be tagged as dirty, even though they all had the same data pointer as their value.
There would also be a need to be sure to replicate other operations, such as removal, among all 512 entries. This solution has the additional downside that representing a 2MiB page via 512 individual entries adds many extra nodes to the radix tree.
Multi-order radix-tree techniques
Both of the solutions mentioned thus far for dealing with huge pages have left the radix-tree API unchanged. Ideally, there would be a solution where the one-to-one mapping between keys and values in the radix tree can be broken. That would allow inserting an entry that covers multiple 4KiB-sized indices and have operations on indices in that range, such as lookup, tag modification, and removal, all act on the same radix-tree entry. Recently there have been several patch series (1, 2, and 3) posted to the Linux kernel mailing list that set out to accomplish this goal via a solution that is called "the multi-order radix tree".
The basic idea is to add the ability to insert entries that span 2X 4KiB-sized page indices. X is referred to as the page's "order". Using this terminology, existing radix trees, in which every entry is associated with a single index, are composed entirely of order-0 entries. An order-2 entry would cover 22 = 4 adjacent indices. The 2MiB entries for the DAX huge-page example would be order-9 entries, and so on.
This new functionality is implemented in the multi-order radix tree using a
pair of techniques: sibling entries and elevated entries.
The smallest multi-order entry is an order-1 entry that covers two adjacent
indices. This is implemented by inserting a special "sibling pointer" for the
second index that points back to the actual radix-tree entry.
In this case, lookups, removals, and tag operations on both the base index and on the index for the sibling operate on the same actual entry in a way that is transparent to the user. For orders greater than 1, there can simply be multiple sibling entries that all point back to the actual radix-tree entry:
With a multi-level radix tree, there can be up to three different types of pointers. The first are internal pointers, which point from a parent radix-tree node to a child radix-tree node. The second are sibling pointers, which point from one entry in a given radix-tree node to another entry in that same node. The third are the void * data pointers that are stored as part of the key-value store.
The lowest bit of the radix-tree entry, RADIX_TREE_INTERNAL_NODE, is used to distinguish between the void * data pointers and the two types of pointers internal to the radix-tree implementation. Sibling pointers and internal node-to-node pointers are distinguished by looking at the value of the pointer itself. If the pointer points within the same node's slots array, it is a sibling pointer. If not, it points to a child radix-tree node.
If the fan-out of the radix tree happens to match the order of the multi-order entry, it can be represented using an elevated entry that simply lives as a child of one of the intermediate nodes in the tree:
Elevated multi-order entries can be the children of intermediate nodes at any level in the tree. Combining these two techniques allows us to have elevated multi-order entries with sibling pointers:
Having both sibling entries and elevated entries allows the radix tree to support multi-order entries of any order.
Radix-tree API changes
To use this new functionality, the multi-order radix tree has a few small API changes where an order parameter was needed.
int __radix_tree_create(struct radix_tree_root *root, unsigned long index,
unsigned order, struct radix_tree_node **nodep,
void ***slotp);
int __radix_tree_insert(struct radix_tree_root *, unsigned long index,
unsigned order, void *);
__radix_tree_create() is only used in one place in mm/filemap.c, and __radix_tree_insert() is a new API added by the multi-order patches. radix_tree_insert(), the old insertion API that is used by all existing code, is now defined to be:
static inline int radix_tree_insert(struct radix_tree_root *root,
unsigned long index, void *entry)
{
return __radix_tree_insert(root, index, 0, entry);
}
The API for operations such as node lookup, deletion, tag manipulation, etc. remain unchanged. This has allowed the multi-order radix tree to be implemented with very little change to existing radix-tree users.
Integration with DAX 2MiB support
I recently posted a patch that integrates the DAX code with the new multi-order radix-tree code. As can be seen from that patch, the changes needed to move from the old method for supporting 2MiB pages to the new multi-order radix-tree support are quite small.
We now insert an order-9 entry when we need to track the status of a 2MiB huge page. This is done as follows:
error = __radix_tree_insert(page_tree, index,
RADIX_DAX_ORDER(pmd_entry),
RADIX_DAX_ENTRY(sector, pmd_entry));
RADIX_DAX_ORDER() gives us an order of 0 for 4KiB pages and an order of 9 for 2MiB pages.
For the rest of the radix-tree operations, like lookup and tag manipulation, there is no longer a need to first check for a 2MiB PMD entry as a special case. It just operates on the radix tree using the index and the radix tree will do the right thing if that index happens to map to a multi-order entry.
One last thing worth noting is that the multi-order radix tree API currently does not define a way for the user to query the order of a given entry. It is not immediately obvious whether this API is actually needed. The DAX code can infer the order of a given entry by looking at its type: RADIX_DAX_PTE or RADIX_DAX_PMD. When multi-order struct page entries start being inserted, their size can most likely be understood by looking at the page flags. However, if an API to query an entry's order proves useful, it could easily be added.
In conclusion
The multi-order radix tree patches have been present in both Andrew Morton's -mm tree as well as Stephen Rothwell's linux-next tree for several weeks; they were pushed upstream during the 4.7 merge window. The integration between DAX PMD entries and the new multi-order radix-tree code will be merged in 4.8 or later, and will need to take into account the recent DAX page-fault-locking patch series from Jan Kara. The combination of the multi-order radix-tree patches and the locking changes will allow DAX to have locks on a per-page basis, regardless of the size of the page.DAX will most likely be the first user of the new multi-order capability of the radix tree, but these changes should be interesting to anyone who deals with multiple page sizes. The transparent-huge-page code could probably make use of this new functionality, and it is likely that other users will spring up over time.
Patches and updates
Kernel trees
Architecture-specific
Build system
Core kernel code
Device drivers
Device driver infrastructure
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
