Kernel development
Brief items
Kernel release status
The current development kernel is 4.9-rc1, released on October 15, one day earlier than some might have expected. "My own favorite 'small detail under the hood' happens to be Andy Lutomirski's new virtually mapped kernel stack allocations. They make it easier to find and recover from stack overflows, but the effort also cleaned up some code, and added a kernel stack mapping cache to avoid any performance downsides." The virtually mapped kernel stack work was covered here in June.
Stable updates: 4.8.2, 4.7.8, and 4.4.25 were released on October 16. The relatively small 4.8.3, 4.7.9, and 4.4.26 updates are in the review process as of this writing; they can be expected on or after October 21.
Quotes of the week
< XFS has gained super CoW powers! > ---------------------------------- \ ^__^ \ (oo)\_______ (__)\ )\/\ ||----w | || ||
Kernel development news
The end of the 4.9 merge window
By the time that Linus released 4.9-rc1 and closed the merge window for the 4.9 development cycle, 14,308 non-merge changesets had found their way into the mainline repository. As expected, this cycle has already broken the previous record for the busiest cycle ever, and it has a while to go still. 820 of those changesets were merged after last week's summary was written. Some of the more interesting changes found in this last set include:
- The XFS filesystem has gained support for shared extents — ranges of
file data that can be shared between multiple owners — and a
copy-on-write mechanism to manage modifications to those extents.
That, in turn, allows XFS to support copy_file_range() along with
other nice features like data deduplication.
- The NFS server now supports the NFS4.2 COPY operation, allowing file
data to be copied without traveling to the client and back.
- The watchdog subsystem has a new "pretimeout" mechanism to allow the
system to respond just prior to the expiration of a timer. Two new
"governors" are provided; one simply prints a log message, while the
other will panic the system in the hope of generating more useful
information for debugging the problem.
- A set of EXPORT_SYMBOL()
improvements has been merged. It is now possible to place export
directives into assembly code, and the handling of exported symbols in
library objects has been improved. One immediate practical result is
that it is now possible to place all EXPORT_SYMBOL()
directives next to the definition of the symbol that is being
exported. At the moment, checksums (for use with
CONFIG_MODVERSIONS) for assembly symbols are not generated;
that should be fixed in the near future.
- The build system can now use "thin archives" for the creation of
intermediate objects, rather than linking them with
ld -r. A thin archive contains symbol information, but
simply points to the component object files rather than making
copies. The main purpose here seems to be to make the PowerPC build
work more smoothly; see this
commit for some more information.
- The build system can also perform dead code and data elimination.
This option is potentially hazardous, since, without some extra
effort, the linker may see some needed code as being dead, but it can
also reduce the resulting image size considerably.
- There is a new GCC plugin called "latent_entropy", which comes
from the grsecurity/PaX patch set. It will instrument
the kernel in an attempt to collect randomness, especially during the
early bootstrap process.
- New hardware support includes: Loongson 1C processors, Freescale "data patch acceleration architecture" hardware buffer and queue-management subsystems, and Imagination Technologies ASCII LCD displays.
At this point the feature work is done; all that remains is to stabilize all that new code for the final 4.9 release. If all goes according to the usual schedule, that release can be expected on December 4 or 11.
Rethinking device memory allocation
James Jones started his 2016 X.Org Developers Conference (XDC) talk by saying that he would like to make some real progress at the conference on creating a user-space API for allocating memory that is also accessible by various devices. His talk on day one of the conference set the stage for a meeting of interested developers on day two. By day three, he reported back in a lightning talk on the progress made.
Jones has worked at NVIDIA on window system integration over the last decade or so, which originally meant X11, but now also includes other window systems. There are some existing solutions for memory allocation, but NVIDIA noticed some drawbacks to them when it tried to make them work with its drivers. So the company proposed EGLStream as a solution, which was "not so well-received so far", but it did help identify the problems that need to be solved.
That proposed patch added EGLStream to the Weston compositor, but it launched a discussion of Generic Buffer Management (GBM), which Weston already uses for memory allocation, versus EGLStream. Many strong views were expressed in that discussion; there has already been considerable investment in the existing APIs, both by Mesa and Wayland developers as well as by NVIDIA, so it is not surprising that there were differences of opinion. But it was nice to have a civil discussion about the memory allocation issue, he said, and many areas for improvement were identified. The discussion has died down and it was suggested that XDC would be a good venue to make some progress on the issue.
The problem is how to allocate device-accessible surfaces (memory buffers for various kinds of graphics and video data) from user space. The devices are things like GPUs, scanout engines, and video encoders and decoders. The surfaces allocated are for textures, images, and such; there is a need for some kind of handle for the surfaces that can be securely passed between user-space processes. In addition, a way to manage the surface state (e.g. format, color parameters, compression) and its layout in memory needs to be part of the API. In order to use these buffers in different parts of the system, some kind of synchronization mechanism is required. The latter is not directly related to the allocation problem, but is something that needs to be kept in mind, he said.
His goal is to get a consensus-based forward-looking API for surface allocation, but he has "no idea" what that API will be, at least yet. It should be agnostic with regard to window systems, kernels, and graphics vendors. So it will be able to be used for window systems like Wayland and others, by old and new Linux kernels, and by other kernels beyond Linux, as long as they are POSIX-like. It would have a "minimal but optimal driver interface" that would still be able to use "100% of the GPU's capabilities". While not directly related to surface allocation, the "final destination", he said, is to have "a completely optimized scene graph" for Weston and other scene-graph compositors.
Prior art
Jones then went into a review of the existing solutions to this problem, with their pros and cons—starting with GBM. At the basic level, GBM has the ability to allocate surfaces and to arbitrate the uses of a surface with a set of flags. It also provides handles to those surfaces. It is incorporated into many code bases at this point, so it is widely deployed and well tested. It has a pretty minimal API and fairly small implementation.
But GBM does have some shortcomings. The handles are process-local; there are ways to import handles from elsewhere, but not to export them to other processes using the API. It is focused on GPU operations (texturing, rendering, and display), so there is no way to specify that a surface would be used for rendering and passed to a video encoder, for example. Related to that is that the arbitration for the capabilities needed by a surface is done only in the scope of a single device, so you can't use the API to specify surfaces that will be used with multiple devices.
The Chrome OS Freon project attempted to add surface state management capabilities on top of GBM. There was a lot of discussion between vendors, but no consensus was reached on an optimal design, so something "not ideal" was settled on. The main point of contention was the level of abstraction in describing the transitions between various uses of a surface.
Android's Gralloc has a similar feature set to GBM. It has support for synchronization using fence file descriptors, but passing handles between processes requires other components from an Android system as there is no direct support for it in Gralloc. It has been widely deployed and is proven in the field. It also has an allocation-time usage specification that has support for non-graphics usage (such as video encoders and decoders).
Many of the shortcomings of Gralloc are similar to those of GBM as well. There is no explicit surface state management and the arbitration abilities are flag-based. It is open source, but the API is proprietary in some sense, since Google controls it.
EGLStream was developed to solve the problems he described, so it is not surprising that it provides allocation, arbitration, handles that can be shared by different processes, state management, and synchronization. NVIDIA has been shipping EGLStream for quite some time for a lot of different use cases, he said. It has been ported to all of the different operating systems that the company supports and has a comprehensive feature set.
While EGLStream is an open standard, in practice there is only a single vendor that has implemented it. It does not have cross-device support and it is EGL-based, which may complicate things by bringing OpenGL into the picture. It has been said that EGLStream does too much encapsulation and tries to do too much extra within the API. In addition, its behavior is loosely defined, or even undefined, in some cases.
The DMA-BUF allocation mechanism provides handles to memory allocations that can be shared between drivers; it supports non-graphics devices as well. But it does not have a centralized user-space allocation API, is Linux-only, and lacks any way to describe the content layout. It also only has a limited means to describe the planned usage of the memory at allocation time.
The Vulkan 3D graphics and compute API is one other thing to consider, Jones said. It provides an allocation mechanism as well as the most detailed allocation-time usage specification that he knows of. It has explicit state management and has a robust synchronization mechanism as well. Vulkan is both extensible and portable, but there is no support at this point for cross-process handles or arbitration. It is also focused only on graphics, compute, and display operations.
Path forward
Based on the prior art and the needs going forward, a set of features needed was identified and generally agreed upon. Whatever the new API is, it should be minimal—anything that is not needed should be eliminated. It should also be portable to multiple platforms and have support for non-graphics devices (e.g. rendering to a video encoder or texturing from a video decoder). It should also use the GPU optimally in the steady state when someone is not moving windows around on the screen; X11 already has this, so anything new should be at least as good.
To achieve that, he believes there is a need for something like what Vulkan has in terms of an allocation-time usage specification. So when the driver is asked for an allocation, all of the different use cases for the surface can be specified. That will allow the driver to negotiate the surface capabilities based on those use cases. During transitions (such as moving a window or going from a window to full screen), the performance still needs to be good. The idea is to allow multiple uses of the surface without having to do reallocations.
So, there are various existing APIs and a set of more-or-less consensus goals; what is the path forward? He suggested focusing on solving specific problems that occur with the existing APIs, rather than trying to pick a winner from those APIs. By solving the problems, it will become clear what the API should look like—what it is called at that point is not particularly important.
Specifically, he suggested that the focus should be on how to create a surface that is cross-driver, cross-engine, and cross-device. Historically, that has been where everything falls apart. If agreement can be reached on that, other simpler cases will just fall out naturally.
He presented a set of assumptions that he hoped would help simplify the initial discussions. To start with, those working on this problem should assume they are designing an ideal allocation API. That may not actually be the case, but it is a good way to think about it. Thinking in terms of the user-space API first, while keeping both API elegance and the capabilities of the hardware in mind, is also important.
There needs to be a standard way to describe the capabilities of different devices (for example, devices have different tiling formats, but other drivers won't know anything about some of those formats). It could be similar to the Khronos data format specification but cover other types of capabilities beyond pixel data formats.
Capabilities could then be queried from each driver, though the list could become quite large, so some filtering mechanism would be needed. There would also need to be a central authority of some sort to maintain the capability namespace. That could simply be a file in a Git repository or, perhaps, a group like Khronos—it simply needs to be authoritative. The surface allocation layer would collect up and intersect the capabilities of all of the different drivers.
There is a question of how to filter these capabilities. The API could provide a way to describe the desired usage of the surface, including things like its format, dimensions, and the operations that will be performed using it. The Khronos data format could again be used as a model for how to describe this information. Some types of data have obvious representations (e.g. width/height) and others can be indicated using Boolean flags like those in Gralloc. But there would also be capabilities that are driver-specific, so drivers would have to ignore ones that are targeted at other devices.
Once the capabilities that are not supported by all of the involved drivers have been eliminated, there needs to be a way to choose the optimal remaining choice. Sorting the remainder depends on the implementation and usage, so it cannot be done by the common framework. His straw-man proposal was to let the application decide once the list has been narrowed down.
After the surface has been allocated, its chosen properties must also be described. That could perhaps use the same data format as the capability information, but it must be communicated to the requester in some fashion.
He finished the presentation by noting that all of what had been discussed thus far concerned the image-level capabilities for the requested memory. But there are also some memory-level capabilities that may come into play, notably whether the memory must be physically contiguous. He thought that the image capability concept could be generalized to cover the memory-level requirements as well. Extensibility to allow for tiling layouts or hardware compression of surfaces, for example, would also be important.
Results
In Jones's lightning report of the meetings held on day two, he indicated that some good progress had been made; agreement had been reached on some key points. An allocation request will contain some basic properties like width, height, and format (others will be available via an extension mechanism) along with a list of usage descriptions (e.g. render target, video encoder input).
The arbitration of the properties is based on intersected sets of supported capabilities along with sets of constraints that get combined together (e.g. a certain stride might constrain the alignment differently between devices). The exact merging of the constraints may not exactly be the union of them, but the merging algorithm will be baked into the library, he said. There will be a set of common capabilities, but some can be vendor-specific; constraint definitions will be shared.
The capability sets will be reported back to the application, which can serialize them to pass to other processes to allow for incremental refinement. Processes could ask that the list be filtered for specific uses to help winnow down the choices. Once that is done, the sorting is handled by the drivers and the allocation takes place once a single capability set has been chosen. This API will be exposed via a library that has user-space driver/vendor back-ends.
There are still plenty of things to be resolved, particularly how sorting the capabilities is actually done. There was a lot of discussion how that might be handled, but no conclusion was reached. In addition, the application may need to be able to tell the hardware when the surface is only being used as one of the use cases and when it transitions to one of the others, but how to do that has not been determined.
How to specify format types is another unresolved piece and they did not discuss the type of handle that would be used for an allocated surface. There is a question whether devices will be enumerable using the API. Also, which kernel interface would be used for allocation has not been resolved. Essentially, Jones said, it has reached a point where folks need to go off and start doing some research and trying things out before further progress can be made.
For more information, Jones's PDF slides from the talk are available, as is YouTube video of his talk and lightning talk report. His notes from the meetings are also available. He posted an update and pointer to his GitHub repository on the dri-devel mailing list on October 4.
[I would like to thank the X.Org Foundation for sponsoring my travel to Helsinki for XDC.]
Linux drivers in user space — a survey
Writing device drivers in user space, rather than as kernel modules, is a topic that comes up from time to time for a variety of reasons. The kernel's approach to user-space drivers varies considerably depending on the type of device involved. The recent posting of a patch set aimed at allowing LED drivers to be written as user-space programs seems like a suitable opportunity to have a look at the range of options currently available.
For it to be possible to write a device driver in user space it is necessary for the kernel to export the required interfaces. There are two different sorts of interfaces, that meet different needs, that the kernel can export; I will call them "upstream" and "downstream" interfaces.
When one reflects on the tree-like nature of the driver model, as described in an earlier article, it is clear that there is a chain, or path, of drivers from the root out to the leaves, each making use of services provided by the driver closer to the root (or "upstream") and providing services to the driver closer to the leaf (or "downstream"). An upstream interface allows a user-space program to directly access services provided by the kernel that normally are only accessed by other kernel drivers. A downstream interface allows a user-space program to instantiate a new device for some specific kernel driver, and then provide services to it that would normally be provided by some other kernel driver.
Upstream interfaces
An upstream interface is one that provides access to some hardware, possibly more directly than with the standard interfaces. In several cases this is provided not with a new interface but with a slight modification to an existing interface. Opening a block device with the O_DIRECT flag allows directly reading from and writing to that device without involving the page cache or the readahead and write-behind that it supports. Similarly, direct access to a serial port is obtained by opening a TTY device and disabling certain termios settings such as ECHO and ICANON. The documentation for cfmakeraw() identifies 16 such flags that are cleared.
Direct access to a network device can be achieved by creating a network socket using the AF_PACKET address family and specifying the SOCK_RAW communication type. This socket can then be bound to a particular interface or a particular Ethernet protocol type. A slightly less direct interface can be had by using SOCK_RAW with AF_INET. This still provides the routing and other functionality common to all IP protocols, but gives complete control over the payload of each IP packet.
Moving on to more purpose-built interfaces, the sg and bsg drivers (SCSI generic and block SCSI generic) both provide direct access to SCSI devices, or other devices such as SATA that use a compatible protocol. They allow SCSI command descriptor blocks (CDBs) to be sent to devices and to have results returned. The bsg interface is integrated with the block layer and supports a newer version of the sg interface that includes support for bidirectional commands. libsgutils is the recommended mechanism for making use of these interfaces, rather than working directly with /dev/sgN. Similarly, libusb provides a direct interface to USB devices, allowing arbitrary USB commands to be sent to any connected USB device.
I2C and SPI — 2-wire and 4-wire buses for communicating between integrated circuits on the same board — can be directly accessed via special-purpose character devices. For I2C, the i2c-tools package provides a scriptable interface. For SPI there do not appear to be any packaged solutions, though the armbedded.eu web site provides some code that would be worth trying for anyone who is interested.
All the interfaces listed so far are always available, to sufficiently privileged processes, if the kernel knows about the target device at all. Other interfaces require the kernel to be explicitly instructed to export a low-level interface. In the case of GPIOs (general-purpose I/O pins) and power regulators, this is as simple as adding some directives to the device-tree description of the hardware. The devices then appear in sysfs complete with attribute files allowing relevant settings to be changed and values to be read.
Finally, and requiring even more in-kernel support, is the UIO framework, which is intended for devices that are accessed through memory-mapped device registers, as is the norm for devices attached to PCI and similar buses. A simple in-kernel device driver can be written using the UIO framework that allows a user-space program to map that register bank into its own memory, and also to respond to interrupts from the device. This does not provide generic access to any PCI device, but does make it easy to get user-space access to a particular device of interest, so that the bulk of the driver can be developed, debugged, and maintained outside of the kernel.
This variety of different interface styles could be seen as a hodge-podge that is just crying out to be unified. On the other hand, different sorts of devices really are different and need different sorts of interfaces. Part of the role of an operating system like Linux is to hide as much of that difference as possible behind uniform abstractions. It should not be surprising that, if we want to bypass those abstractions and access the devices directly, we will be confronted by the variety that Linux generally tries to hide.
Downstream interfaces
Where upstream interfaces provide direct access to hardware, downstream interfaces allow a program to emulate some hardware and so provide access to other programs that expect to use a particular sort of interface. Rather than just providing a different sort of access to an already existing device, a downstream interface must make it possible to create a new device, configure it, and then provide whatever functionality is expected of that device type.
Probably the first driving force for these downstream interfaces was the introduction of networking and the consequent desire to allow a program on one computer to work with a device on another computer. With this came pseudo TTYs (PTYs), which are likely the oldest downstream interface in Unix. They allow a TTY to be created on which a user can log in and run programs that don't need to be aware that they are not attached to a physical terminal. The text entered can easily come from anywhere on the network, and the output generated can go back to the same place (or elsewhere).
The desire for network access to storage brought about such things as nbd, the network block device, and NFS, the network file system. Their design differs from that of PTYs in that they don't just provide an interface to user space that a network service could use but, instead, create the network connection themselves and define a protocol to carry the data and control over that connection. The most likely reason for this is that managing a storage service in a user-space program is prone to deadlocks. If the program ever needs to allocate memory, the kernel might choose to free up memory by writing out to a storage device, and if that device is managed by the program allocating memory it could easily deadlock. It is much safer to bypass user space and send directly to the network.
These network protocols can still serve as downstream interfaces in that they make it possible to instantiate a block device (with nbd) or a filesystem (with NFS) and provide services to it. This has been used to good effect with automounting programs such as amd (subsequently renamed to am-utils) that present as an NFS filesystem that contains only directories and symlinks (thus avoiding any deadlock issues) and transparently mounts filesystems when they are first accessed.
Though using NFS for this purpose is quite effective, it is not perfect; due to the limited possible interactions with the Linux virtual filesystem layer, filesystems must be mounted somewhere else and the NFS filesystem only contains a symbolic link to the real mount point. To address this shortcoming, Linux provides a dedicated downstream interface for creating filesystems, autofs, which supports the extra interactions required to automount filesystems directly onto directories.
Similarly there is a downstream interface for writing filesystems that is careful about how it interfaces with the page cache, and manages to avoid the writeback deadlocks described above: FUSE.
As part of FUSE there is CUSE, which allows character devices to be implemented in user space. There does not appear to be a corresponding "BUSE" for implementing block devices in user space, though some years ago there was a proposal for "ABUSE" which aimed to do just that. Block devices can be implemented in user space on a remote machine using nbd and presumably that is sufficient to meet most needs.
Networking plays a role in the next pair of examples too; the TUN and TAP drivers allow network devices to be emulated. TAP sends and receives Ethernet frames, so any networking protocol can be used with a TAP device. TUN works at the IP level, which is simpler and often sufficient providing there is no need to handle non-IP protocols such as ARP. These can most obviously be used for tunneling and creating virtual private networks (VPNs) but could also be used for user-space monitoring and filtering of network traffic.
Network devices, block devices and character devices (which include TTYs) cover all the device types that Unix supported before Linux came along. Linux has added a variety of new device types, some of which can be implemented in user space.
The input subsystem provides a standard interface for input device such as keyboards, mice, joysticks, touch pads, and similar devices. These are exposed to user space as character devices, so it might be possible to emulate them using CUSE, but it is more convenient if they are integrated with the rest of the input subsystem, and that is what uinput allows. If a program opens /dev/uinput and issues some ioctl() commands, a new input device is created. Events will be reported on that device when they are written to the file descriptor opened on /dev/uinput.
User-space LEDs — how and why
The latest addition to the collection of downstream interfaces is conceptually similar to uinput but it allows the emulation of LED devices rather than input devices. To support this functionality, it introduces a new device called /dev/uleds. Opening this device and writing the name of a new device (zero-padded to 64 bytes) will create an LED device with the given name.
There is no option to configure any other aspects of the LED, but there is not much that could be configured anyway. A LED can generally indicate the number of brightness levels that it can support; LEDs created with uleds always support 256 brightness levels. Whenever the brightness is changed, a byte can be read which reports the new level. An LED can also indicate that it knows how to blink so that, when needed, it can be given a single "blink" request rather than periodic "on" and "off" requests. A uleds device cannot be used to experiment with this functionality, but it could undoubtedly be added later using an ioctl() if a need was found.
The particular need that is driving the development of this interface
by David Lechner is the desire to make two embedded systems compatible
with one another. "I would like to make a userspace program
that works the same on both devices.
" If that program accesses
an LED device directly, the device must appear to be present on both
systems; where it isn't physically present it can now be emulated,
possibly using a widget on a graphic display.
At much the same time Marcel Holtmann had been working on a similar interface to allow the testing of LED triggers from the Bluetooth subsystem. Various subsystems can be connected to a LED, using a trigger, to signal the current state of that subsystem. Without an LED device, it is hard to test those triggers. With the ability to emulate a LED device, that impediment to development need no longer exist.
The 4th revision of the user-space LEDs patch set was posted in mid-September and appears to have addressed all the issues that reviewers found. We can expect the code to land in mainline for Linux 4.10. It seems unlikely that this will be the last device type that someone will want to emulate. Some devices, such as power regulators, seem so intimately related to hardware that it is hard to imagine an emulator ever being wanted. Others, like maybe a GPIO, might usefully be provided with a downstream interface for emulation. Whether there turns out to be a genuine need for that is something we will have to wait to see.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Device drivers
Device driver infrastructure
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>