Kernel development
Brief items
Kernel release status
The 4.1 merge window is still open; see the article below for a summary of what has been merged in the last week.Stable updates: 3.19.5, 3.14.39, and 3.10.75 were released on April 20.
Quotes of the week
Kernel development news
4.1 Merge window, part 2
As of this writing, just over 9,800 non-merge changesets have been pulled into the mainline repository for the 4.1 development cycle; that's about 6,200 since last week's summary. Quite a few new features have been added as a result of all that merging; some of the most interesting, user-visible additions include:
- The simple persistent-memory driver
has been merged, improving the kernel's support for large,
non-volatile RAM devices.
- Support for file and directory encryption
in the ext4 filesystem has been pulled into the mainline.
- Multi-user operation is now optional: the
patch set removing support for non-root users has been merged.
This feature is mostly useful for those building dedicated kernels for
tiny embedded systems.
- The cls_bpf networking traffic-control classifier can now apply
extended BPF (eBPF) programs to packets. As shown in this
commit, this can allow the writing of arbitrary filter routines in
C that are then translated to eBPF for execution in the kernel.
The act_bpf module can run eBPF programs now as well. These
programs can, with either module, make changes to packets via the new
bpf_skb_store_bytes() function. The
eBPF engine has also gained the ability to access selected fields from
the socket buffer (SKB) data structure.
- Basic packet routing using the multiprotocol
label switching (MPLS) mechanism is now supported.
- The kernel has gained support for RFC 7217 IPv6
"semantically opaque interface identifiers."
- The maintainer of the Smack security module has reluctantly added a
"bringup mode" that can be used to debug security configurations.
"
So, it's there, it's dangerous, but so many application developers seem incapable of living without it I have given in. I've tried to make it as safe as I can, but in the end it's still a chain saw.
" - User-mode Linux has seen its support for multiprocessing and highmem
ripped out. Neither feature worked well (if at all) and both were
maintenance burdens.
- The kernel's "execution domain" support has been removed. The idea
behind this feature was to allow the provision of non-Linux
"personalities," but it has never seen much use or worked all that
well.
- The "zram" block device can now perform compression of block data.
(See this article for details on
zram).
- The MIPS architecture has gained support for "XPA" addressing,
allowing physical memory addresses up to 40 bits in length to be
accessed on 32-bit systems.
- The device mapper can now operate as a multiqueue block device,
increasing its scalability. This feature is currently disabled by
default, but can be turned on with the CONFIG_DM_MQ_DEFAULT
configuration variable.
- The "virtual GEM" graphics memory manager has been merged into the
direct-rendering subsystem. It provides memory management for a
virtual graphical device that can be useful for code doing
rendering in software.
- New hardware support includes:
- Processors and systems:
IMG Pistachio SoC-based boards,
MIPS common device memory map buses,
Marvell Armada 39x boards,
Annapurna Labs Alpine platforms, and
Xilinx ZynqMP SoCs.
- Audio:
Maxim max98925 codecs.
- Clock:
Marvell Armada 39x SoC clocks and
Qualcomm MSM8916 global clock controllers.
- Miscellaneous:
Broadcom iProc RNG200 random-number generators,
Broadcom BCM7038-style Level 1 interrupt controllers,
Imagination Technologies hardware hash accelerators,
STMicroelectronics ST33ZP24 TPM security chips,
Qualcomm PM8941 LED controllers,
Conexant Digicolor CX92755 realtime clocks,
MIPS EJTAG fast debug channel TTY ports,
Altera GPIO controllers,
Parade DisplayPort-to-LVDS bridges,
High-speed UARTs with DMA controllers,
Ingenic JZ4780 SoC NAND/external memory controllers,
Maxim MAX77843 micro USB interface controllers, and
VMware virtual mice.
- Networking:
NXP Semiconductors NCI near-field communications controllers.
- Video4Linux: LG Electronics LGDT3306A demodulators, Hauppauge HVR-955Q ATSC/QAM tuners, TechnoTrend TT-connect S2-4600 DVB-S/S2 tuners, Omnivision OV2659 sensors, and Xilinx video subsystems.
- Processors and systems:
IMG Pistachio SoC-based boards,
MIPS common device memory map buses,
Marvell Armada 39x boards,
Annapurna Labs Alpine platforms, and
Xilinx ZynqMP SoCs.
Changes visible to kernel developers include:
- The aio_read() and aio_write() methods have been
removed from the file_operations structure. The (relatively)
new read_iter() and write_iter() methods should be
used instead.
- As usual, see Daniel
Vetter's summary for a complete list of changes to the Intel i915
graphics driver.
- The HD-audio subsystem has been reorganized around a new "hdaudio" bus
that simplifies much of the device management and binding code.
- There is a new "log writes" target for the device mapper that logs all
write operations to a block device. It is meant for filesystem
debugging; see Documentation/device-mapper/log-writes.txt
for details.
- The new GPIO "hogging" mechanism can be used to easily (and permanently) wire the state of a specific GPIO line without the need for driver code; see this documentation patch for details.
At this point, most of the major trees have been pulled and the merge window is drawing toward a close. If the usual schedule is followed, that closing will happen (and 4.1-rc1 will be released) on April 26.
Taking control of SSDs with LightNVM
A great deal of work has gone into improving the Linux kernel's block layer so that it can keep up with solid-state storage devices (SSDs). Dealing with SSDs has often been an exercise in frustration, though, for one simple reason: the kernel is not able to manage the storage device directly, but, instead, must talk to a computer embedded in the device that is running some sort of flash translation layer (FTL) software. Developers have often felt that a better job could be done without the FTL getting in the way. Now, it seems, the hardware manufacturers are starting to make direct control easier; a patch set has been posted that aims to enable the kernel to take advantage of this opening.The problems with flash translation layers are numerous and well known. Often they are designed to optimize access for one specific filesystem (FAT, for example), a feature that often makes performance worse for other filesystems. Attempts to allow operating systems to communicate usage information to the drives (the discard/TRIM command, which indicates a range of blocks that is not in use, is an example) have led to performance problems and bugs of their own. And an FTL baked into a drive cannot normally be upgraded or fixed when bugs appear. All of these problems would go away if the kernel could just access the low-level storage media directly.
There are a number of high-end nonvolatile memory (NVM) devices that provide this access now, but there's one catch: each model has its own interface, and sometimes the nature of those interfaces varies wildly. The rest of the kernel, though, cares little about those details; it needs to know a relatively small number of parameters to be able to manage such a device. The required information includes the layout of blocks on the device, some timing details, and not a whole lot more. If there were an abstraction layer in the kernel that provided just the required interface, the task of managing these devices would get easier.
A candidate for this abstraction layer is LightNVM, posted by Matias Bjørling. LightNVM is, at its base, a specification of an interface by which the kernel can access what Matias calls "open-channel SSDs." His implementation adapts the kernel's NVM Express driver to provide the LightNVM interface; the generic block layer code is then adjusted to take advantage of the new capabilities that are provided.
A LightNVM driver is, to begin with, an ordinary block driver. To get the full performance advantage, it should implement the multiqueue block interface, though that does not appear to be strictly necessary. On top of the block interface, though, the driver must implement a set of LightNVM-specific APIs, most of which are defined by this structure of function pointers:
struct nvm_dev_ops {
nvm_id_fn *identify;
nvm_get_features_fn *get_features;
nvm_set_rsp_fn *set_responsibility;
nvm_get_l2p_tbl_fn *get_l2p_tbl;
nvm_erase_blk_fn *erase_block;
};
The identify() operation identifies the type of the device and, importantly, the number of independent I/O channels it supports; that number affects how many operations can execute in parallel. A call to get_features() obtains information about the capabilities of the drive, including whether it can do its own logical-to-physical address mapping, whether the drive performs garbage collection, whether it can perform ECC error correction, and so on. The set_responsibility() function tells the drive which features should actually be enabled. The current mapping between logical and physical blocks can be read from the device with get_l2p_tbl(). Finally, a call to erase_block() will cause a specific erase block to be wiped.
The code as posted appears to expect that on-drive logical-to-physical mapping will be supported, but that no other features will be present. Adding support for the other features should be an optimization opportunity in the future, especially as drives supporting options like "block move" (which relocates a block on the drive without requiring the host to read and rewrite it) become available.
At the block-layer level, the patch set provides a mechanism by which the LightNVM code can intercept I/O requests, remap them, and pass the modified request directly to the hardware. For a read request, this task is relatively simple: the logical-to-physical table is consulted to locate the block's address on the drive, then a read is performed from that address. Writes are more complicated, since data in flash cannot be rewritten in place. Instead, the code must find a new location for the block, cause it to be written there, invalidate the copy of the block at the old location, and update the translation maps accordingly.
In the posted patch set, this work is done by the "RRPC target," a round-robin FTL built into the kernel itself. The wear-leveling algorithm used is fairly simple; the code sequences through erase blocks one after another, relocates any valid sectors found within each erase block, then wipes the block for reuse. It does not even support discard requests at this point. The point is clearly to demonstrate a functioning in-kernel FTL while leaving the optimization opportunities for later.
There will be a number of such opportunities, but it could take a while to realize many of them. For example, getting the best performance out of such a device requires spreading data across each of the available channel controllers in such a way as to keep them all busy. To an extent, that could be done purely in the FTL, but chances are good that higher performance will result if the filesystem is aware of the device's geometry. Currently there is no API to pass that information up, so, needless to say, no filesystems have that support.
So LightNVM in its current form is just a start. But it should be enough to test the idea that kernel developers can, in the long run, do a better job of managing flash arrays than the firmware developers who write FTLs have traditionally been able to achieve.
There is one last question that has not really even been asked yet with regard to this patch set, though. LightNVM is intended to manage nonvolatile memory as if it were a block storage device. But there is a lot of work going into the creation of large, nonvolatile-memory devices that are mapped directly into the system's physical address space; from there, that memory can be mapped into a process's virtual address space. While it would be possible to use a kernel layer like LightNVM to do the low-level management for directly mapped devices, that does not appear to be the approach that most manufacturers (and developers) have in mind. So it seems likely that the FTL will remain deeply buried within the hardware for those devices. That could, in the long term, restrict the applicability of block-oriented subsystems like LightNVM.
The kdbuswreck
Few readers will have failed to notice by now that the attempted merging of the kdbus interprocess communication system into the 4.1 kernel has failed to go as well as its proponents would have liked. As of this writing, the discussion continues and nothing has been merged. This article constitutes an attempt to derive a bit of light from the massive amounts of heat thatSome observers have portrayed the opposition to kdbus as a front in the systemd wars, the intent being to obstruct its merging and set back the perceived systemd agenda. There have been a few messages mentioning systemd and expressing a lack of trust in its developers, but that has been the smallest part of the conversation; it can be safely disregarded. That is not where the serious objections come from.
As was mentioned last week, there is a certain level of discomfort with the core aspect of the design of kdbus: that it implements the D-Bus protocol. Some developers would rather not see kdbus in the kernel at all; others wish that it were an add-on to a more generic messaging solution. With regard to the D-Bus design, this message from Havoc Pennington, one of the original designers of D-Bus, is worth a read. In short: he acknowledges that D-Bus is not perfect, but asserts that it does incorporate a lot of lessons from previous attempts and, as a result, it has been successful.
The most specific advocate of a more general messaging solution is arguably Alan Cox. His latest suggestion would appear to be to go back to the old AF_BUS approach; this patch implemented something D-Bus-like over sockets, but was rejected by the networking maintainers. Alan thinks it's worth another try, given that the kernel already has almost everything that is needed. There have been few signs, though, that the kdbus developers are in the mood to drop their work and attempt to resurrect an approach that has already failed once to get into the kernel.
Metadata and capabilities
The fiercest bone of contention, though, would appear to be a topic that has come up before: the passing of process-specific metadata with messages. In particular, developers led by Andy Lutomirski have continued to assert that kdbus should not attach information about a sending process's capabilities and command line to messages as they cross the bus.
The purpose of the transmission of capabilities, in particular, is to
enable privileged processes on the bus to carry out actions at the request
of another process on the bus — if that other process has the requisite
capabilities. The plans for systemd involve allowing processes to request
actions like changing the system time, tweaking the network configuration,
or rebooting the system over the bus; the requested action will only be
carried out if the requester has CAP_SYS_TIME,
CAP_NET_ADMIN, or CAP_SYS_BOOT, respectively.
The kdbus developers point out that one process can learn about another process's capabilities now by reading files in /proc. There's a little problem, though: reading from /proc is subject to race conditions. A process could request a privileged action over D-Bus, then quickly use exec() to run a setuid binary. If the exec() happens before the receiving process gets around to reading /proc, that process will see the new binary's elevated privileges and allow something that the original caller should not have been able to do. So capability-based authentication is not much used in current systems. One of the many appealing features of kdbus is that it makes such capability checks safe; the kernel can guarantee that the capabilities it transmits with the message are what the sending process held when the message was sent.
Andy (and others) have a number of objections to this approach, starting
with the fact assertion that capabilities are meant to be
interpreted by the kernel,
not by user space. By adding these features, user-space developers are
said to be violating the layering of the system while
broadening the meaning of the relevant capabilities — and they are
generally seen as being overly broad already. As an example,
CAP_SYS_BOOT gives the ability to call the reboot()
system call and immediately reboot the system. Systemd will respond to a
reboot request (from a process with CAP_SYS_BOOT) over
D-Bus, however, by initiating a clean reboot, unmounting
filesystems, shutting down services, etc. Those are actions that
CAP_SYS_BOOT would not enable on its own. Eric Biederman was
quick to suggest that this extension of the
CAP_SYS_BOOT capability could be helpful to an attacker.
Andy also pointed out that the set of capabilities is determined by the kernel source. They can never be extended, so they will limit the expressiveness of authentication mechanisms using kdbus. It would be better, he said, to have a separate, capability-like mechanism implemented in user space that could be extended as the need for new privileges is encountered.
Then there is an interesting little problem in the intersection of
capabilities and user namespaces. If a
process connects to D-Bus, then moves into its own user namespace, it will
appear to have all available capabilities. That would allow the capability
checks to be bypassed entirely. This particular problem was fixed in kdbus
some time ago by simply dropping the capability metadata when a message
crosses a user-namespace boundary. But that fix comes at a cost: now the
capability checks do not work at all for processes in user namespaces. The
capability-based authentication mechanism, in other words, falls apart on a
system where user namespaces are being used for containerization. Systemd
maintainer Lennart Poettering doesn't see this
limitation as a problem because user namespaces are not (yet) heavily
used, but others may well disagree with this assessment.
Eric pointed out that there is a capability translation mechanism that could be used to properly transmit capabilities across namespace boundaries. But he also complains that passing capabilities leaks information about sending processes and is thus a security problem in its own right. Linus was not particularly sympathetic to that particular concern, but others, Andy and Alan included, feel that a process should explicitly indicate that it intends to perform an action requiring a specific capability before any such information should be sent.
Finally, though it hasn't been said explicitly, there is the simple fact that most kernel developers see capabilities as a failed experiment. There is no shortage of developers who would like to see them removed from the kernel altogether. That cannot be done — too many tiresome problems with applications breaking and such — but this feeling does lead to resistance to code that seems to expand the role of capabilities further.
Lennart, though, maintains (in the message linked above) that capabilities
do have their value and that capability checks are better than an
all-or-nothing check for root privileges. He is not thrilled with the
suggestion that kdbus should support implement a new
user-space privilege mechanism,
saying that "we are not really in the business in designing
comprehensive new access control systems that can be used for in-kernel and
in-userspace subsystems.
" There seems to be little inclination to
consider alternatives (especially those that do not actually exist) at this
point.
And that seems to be the core of the impasse. Andy believes that this use of capabilities is dangerous, extending their meaning and bringing in a bunch of security-related code for little real benefit. The kdbus designers, instead, see metadata attachment as a useful tool for the implementation of sandboxing and privilege-separation schemes, and they are unwilling to drop it. Both positions seem firmly entrenched at this point, so it may well come down to what Linus decides to do. He has, for the most part, stayed out of the discussion, but in one message he indicated that most of the capability-related worries don't concern him that much. So he may yet pull kdbus into the kernel, though it would not be entirely surprising if it had to wait one more development cycle first.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Device driver infrastructure
Documentation
Filesystems and block I/O
Memory management
Networking
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
