Kernel development
Brief items
Kernel release status
The current development kernel is 4.10-rc2, released on January 1. This prepatch contains only 27 patches merged since 4.10-rc1 was released on December 25.Stable updates: none have been released since December 15. The 4.9.1, 4.8.16 and 4.4.40 stable updates are in the review process as of this writing; they can be expected on or after January 6.
Finishing out the 4.10 merge window
As expected, the 4.10 merge window ended on December 25 with the 4.10-rc1 release. In the end, 11,455 non-merge changesets were pulled into the mainline repository for 4.10, making this a reasonably busy development cycle, even if it falls far short of 4.9. Less than 400 of those changes were pulled after the December 22 summary was written, so the list of additional changes is short.That list includes:
- The PA-RISC architecture has gained support for kernel address-space
layout randomization.
- The cache-allocation technology patch
set has been merged. These patches provide access to a new mechanism
in Intel processors by which the processor's memory cache can be
partitioned between processes. It can be used to keep one group of
processes (a container, say) from dominating the cache, or to set
aside a portion of the cache for a set of privileged processes.
- There are new drivers for QLogic QEDI 25/40/100Gb iSCSI initiators and
Loongson1 SoC hardware watchdogs.
- The cycle_t type used for clock values inside the kernel has been removed; a plain u64 type is now used instead.
The 4.10 stabilization period got off to a slow start due to the holidays; only 27 non-merge changesets were applied between 4.10-rc1 and 4.10-rc2. The pace of change can be expected to pick up, though, as developers return to work and the final 4.10 release date (probably February 12 or 19) approaches.
Kernel development news
Tracking functional dependencies between devices
Computing systems have grown significantly in complexity since the Linux kernel was first written. In response, the kernel has developed new mechanisms for managing device complexity, including the driver model, dynamic number assignment, and more. These mechanisms have solved a number of problems but, while the problem of managing runtime dependencies between seemingly independent devices has been understood for some time, it didn't get a proper solution until the 4.10 merge window.Some device dependencies are inherent in the architecture of the system. For example, a USB peripheral will not be usable if the appropriate USB host adapter is unavailable, and that adapter is probably connected to some other system bus that must also be up and running. Dependencies based on the connection topology of the system are relatively easily represented in a tree structure; that is what the kernel's device model was created to do. Using this model, the kernel can, for example, suspend devices in the system in the correct order, keeping intermediate devices operational until all the devices that depend on them have been shut down.
In a modern system, though, the dependency graph can be rather more complicated. A camera "device", for example, is likely to be a set of interconnected devices that look independent to the kernel. Actually operating the camera requires a sensor device, which is probably controlled via a connection to an I2C bus; it probably also depends on a couple of GPIO devices for its power and reset lines. The sensor is connected to a separate bridge device that acquires the image data; that bridge may need a DMA controller to move that data into memory. There may be other devices for various hardware-implemented image transformations (rotation or color-space conversion, for example) in the mix as well.
The point is that each of these components looks like a separate device to the kernel. These devices are on separate controllers and, perhaps, on separate buses; they do not appear to be related from a look at the topology of the system. For the most part, a top-level driver marshals these devices together and makes them function together; the information it needs to do this task is, in current systems, often found in the device tree structure. But the kernel's driver core can break things if it shuts down one of the subdevices because it doesn't understand that other devices depend on that subdevice.
Drivers have tended to work around this problem with one-off device-specific code. As one might expect, that leads to a fair amount of code duplication and a lot of inadequate solutions. It would be better to have a single solution in the driver core code that works for all devices. Moving toward that solution is the objective of the functional dependencies infrastructure merged for the 4.10 kernel.
The interface to this mechanism is relatively simple, consisting of a single function to indicate that a dependency exists:
struct device_link *device_link_add(struct device *consumer,
struct device *supplier,
u32 flags);
This call informs the driver core that the consumer device depends on the supplier device. So, for example, the system will not suspend supplier unless consumer is already suspended, and it will not probe or resume consumer until supplier is up and functional. Additionally, if the supplier device is unbound, the consumer device will, by virtue of no longer being able to function anyway, be unbound automatically.
By default, device links are persistent and will remain in place for as long as the system is running. If, however, the DL_FLAG_AUTOREMOVE flag is provided when the link is created, that link will be automatically removed if the driver of the consumer device is unbound. These non-persistent links can be useful in situations where the hardware can be configured in multiple ways, creating varying dependencies over time. The DL_FLAG_STATELESS flag can be used to create a link that is used for suspend/resume ordering, but which is not otherwise managed by the driver core.
If there is a need to explicitly remove a device link, that can be done with a call to device_link_del():
void device_link_del(struct device_link *link);
As of 4.10-rc2, there is only one user of this new infrastructure (the Exynos I/O memory-management unit code) in the mainline kernel. There will certainly be others that will show up in future development cycles, though. With luck, they will be accompanied by a reduction in driver-specific dependency-handling code and an improvement in kernel quality overall.
Context information in memory-allocation requests
As is the case with many other tasks, allocation of memory in the kernel is rather more complex than it is in user space. The APIs used for allocation in the kernel have evolved over many years to reflect this complexity. But a highly evolved API is not necessarily an optimal one, and there have been signs for years that the approach used in the kernel is not perfectly suited to the task. A set of patches under consideration now shows how thinking has shifted in this area.Memory-allocation complexity in the kernel is driven by constraints on what the kernel can do in any given situation. It is often the case, for example, that the kernel is running in a context where it is not allowed to block waiting for an event, so allocation requests must be satisfied without acquiring any sleeping locks. Sometimes a request should be given access to the last-resort pools of memory; this is usually the case when the request itself is part of the process of freeing more memory in a system where memory is tight. There can be constraints on where the allocated memory must be located. And so on.
The approach taken to track these constraints is to add a "GFP flags" argument to every memory-allocation function. So, for example, the prototype of kmalloc(), used to allocate relatively small chunks of memory, is:
void *kmalloc(size_t size, gfp_t flags);
The flags argument describes the constraints on the request. A value of GFP_ATOMIC indicates that the request is running in atomic context and cannot sleep, for example, while GFP_DMA32 says that the allocated memory must be placed in a physical location reachable by devices limited to 32-bit DMA operations. There is a long list of these flags; <linux/gfp.h> has the whole set.
Two types of flags
The point of interest here is that some of these flags (like GFP_DMA32) describe attributes of the needed memory — they apply to a specific allocation request. But others, like GFP_ATOMIC, instead describe the context in which the allocation request is being made. This context is often not known at the point where the allocation function is called, since that often happens in low-level code that can be invoked in many contexts. So higher-level code must inform the low-level code about the context in which it is running; this is generally done by adding GFP-flags arguments to functions all the way up the call chain. To pick a random example, consider the function that submits a request to a USB device:
int usb_submit_urb(struct urb *urb, gfp_t mem_flags);
This relatively high-level function must track the given mem_flags and pass them to any function it calls that might allocate memory; it must also adjust the flags if its own context changes. This interface has been made to work for many years, but it is somewhat prone to errors. One could argue, as some have over the years, that it is fundamentally wrong; information tracking the context in which a thread is running might be better attached to the thread directly rather than dragged along through a chain of function calls.
GFP_NOFS
One flag in particular that describes the calling context is GFP_NOFS, which instructs the memory allocator to avoid calling into any filesystem code. In particular, that means that the allocator cannot start writeback on dirty pages to make more memory available. There are (at least) a couple of reasons to impose this constraint. One is that the allocation call itself is coming from filesystem code; in that case, calling back into the filesystem risks deadlocking the system. The other is that adding filesystem calls to a lengthy call chain could overflow the kernel stack, an outcome cherished by attackers but otherwise unloved by Linux users.
Given those possibilities, it is unsurprising that kernel developers have tended to take a "better safe than sorry" approach to the GFP_NOFS flag; as a result, that flag appears in a great many allocation calls — a quick grep shows over 1,300 instances in the 4.10-rc2 kernel. At the Linux Storage, Filesystem, and Memory-Management summit in April 2016, Michal Hocko called out use of this flag as a problem. It appears in many places where it is not really needed, unnecessarily constraining what the memory-management code can do and, as a result, worsening system performance. He suggested that this flag should be phased out in favor of a flag in the task structure that could be used to accurately track the allocation context.
More recently, he has proposed a new API that implements these changes. A new flag (PF_MEMALLOC_NOFS) is defined for the flags field of the task_struct structure. Then, whenever a thread enters a context where filesystem calls should not be made, it should call:
unsigned int memalloc_nofs_save(void);
This call will set the PF_MEMALLOC_NOFS flag and pass the previous flags value back as its return value. Exiting from the "no filesystem calls" context is done with a call to:
void memalloc_nofs_restore(unsigned int flags);
The flags value passed in should be the value returned from memalloc_nofs_save().
Between the two above calls, all memory-allocation requests executed in the current thread will behave as if the GFP_NOFS flag had been passed, regardless of whether it is actually present or not. Since each caller saves the previous context, these calls can be nested to any level and the right thing will happen. For now, the GFP_NOFS flag remains (there is the matter of those 1,300 users, after all), but one can see its eventual removal in the cards. The patch set begins that process by fixing some callers in the XFS and ext4 filesystem code. The resulting code should be clearer, and it eliminates the chance of a stray allocation calling into the filesystem code in the wrong place.
Developers familiar with the memory-management code may think that this interface looks familiar. Indeed, it is inspired by a set of already existing functions:
unsigned int memalloc_noio_save(void);
void memalloc_noio_restore(unsigned int flags);
These functions were added to the 3.9 kernel in 2013 by Ming Lei; they move the GFP_NOIO flag (which inhibits the initiation of I/O operations) into the task structure in the same way.
The memory-allocation interface is, thus, clearly evolving in a direction where context-related information is stored with the rest of the thread's context rather than being passed through function arguments. This evolution can only be described as a slow process, though; there are nine memalloc_noio_save() calls in the 4.10-rc2 kernel, compared to nearly 500 uses of GFP_NOIO. Increasing the pace of change may be hard, though; switching to the new API requires a fairly deep understanding of the code involved and cannot be done with a simple script.
One could imagine taking this work further by, for example, tracking atomic context explicitly. But that is work for the future; completing the task for GFP_NOIO and GFP_NOFS should arguably be done first. Once all that is done, the kernel's memory-allocation API may better match the uses to which it is put. Given that we have only had 25 years to work on it so far, it is not entirely surprising that we have not gotten there yet.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Device drivers
Device driver infrastructure
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
