|
|
Log in / Subscribe / Register

Leading items

Welcome to the LWN.net Weekly Edition for November 30, 2017

This edition contains the following feature content:

This week's edition also includes these inner pages:

  • Brief items: Brief news items from throughout the community.
  • Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Python data classes

By Jake Edge
November 29, 2017

The reminder that the feature freeze for Python 3.7 is coming up fairly soon (January 29) was met with a flurry of activity on the python-dev mailing list. Numerous Python enhancement proposals (PEPs) were updated or newly proposed; other features or changes have been discussed as well. One of the updated PEPs is proposing a new type of class, a "data class", to be added to the standard library. Data classes would serve much the same purpose as structures or records in other languages and would use the relatively new type annotations feature to support static type checking of the use of the classes.

PEP 557 ("Data Classes") came out of a discussion on the python-ideas mailing list back in May, but its roots go back much further than that. The attrs module, which is aimed at reducing the boilerplate code needed for Python classes, is a major influence on the design of data classes, though it goes much further than the PEP. attrs is not part of the standard library, but is available from the Python Package Index (PyPI); it has been around for a few years and is quite popular with many Python developers. The idea behind both attrs and data classes is to automatically generate many of the "dunder" methods (e.g. __init__(), __repr__()) needed, especially for a class that is largely meant to hold various typed data items.

Python's named tuples are another way to easily create a class with named data items, but they suffer from a number of shortcomings. For one, they are still tuples, so two named tuples with the same set of values will compare as equal even if they have different names for the "fields". In addition, they are always immutable (like tuples) and their values can be accessed by indexing (e.g. nt[2]), which can lead to confusion and bugs.

As the "Rationale" section of the PEP notes, there are various descriptions out there of ways to support data classes in Python, along with people posting questions about how to do that kind of thing. For many, attrs provides what they need (Twisted developer Glyph Lefkowitz championed the module in a blog post in 2016, for example), but it also provides more than some need. Beyond that, discussion in the GitHub repository for the data classes project indicated that attrs moves too quickly and has extra features that make it not suitable for standard library inclusion. Data classes are meant to be a simpler, standard way to have some of that functionality:

Data Classes are not, and are not intended to be, a replacement mechanism for all of the above libraries. But being in the standard library will allow many of the simpler use cases to instead leverage Data Classes. Many of the libraries listed have different feature sets, and will of course continue to exist and prosper.

Eric V. Smith picked up the suggestion of writing a PEP from the python-ideas thread and posted the first version to python-dev back in September. The first set of comments was the inevitable bikeshedding over the name, which continued even after Guido van Rossum asked that it stop. Van Rossum is satisfied with the "data classes" name, though others like "record", "struct", and the like. There were some more technical comments made in that thread, which Smith incorporated into the revision he posted about in late November.

The overall goal is to reduce the boilerplate that needs to be written for a class with typed data fields. To that end, there is an @dataclass decorator (in the dataclasses module) that processes the class definition to find typed fields. It then generates the various dunder methods and attaches them to the class, which is then returned by the decorator. It would look something like the following example from the PEP:

    @dataclass
    class InventoryItem:
	'''Class for keeping track of an item in inventory.'''
	name: str
	unit_price: float
	quantity_on_hand: int = 0

	def total_cost(self) -> float:
	    return self.unit_price * self.quantity_on_hand

The total_cost() method was put into the example to help show that a data class is simply a regular class and can have its own methods, be subclassed, and so on. From the above declaration, InventoryItem would automatically get a properly type-annotated __init__() method, along with a __repr__() that produces a descriptive string and a bunch of comparison operators (e.g. __eq__(), __lt__(), __ge__()). None of those need to be written or maintained by the developer.

More fine-grained control over the generated methods is available using parameters passed to the dataclass() decorator. There is a handful of boolean flags that determine whether certain methods are generated (init, repr, eq, compare); the latter two allow only generating equality methods (__eq__() and __ne__()) or generating the full set of comparison methods. These methods all test objects as if they were a tuple of the fields in the order specified in the class definition.

There was some discussion of how to handle comparisons between objects that have different types. Obviously, comparing unrelated objects should raise an exception (NotImplemented), but for subclasses that don't add any fields, an argument could be made that the comparison should be done. Smith considered using an isinstance() check, but ended up taking the lead from attrs and sticking with strict type checks for all of the comparison operators. This GitHub issue has a bit more discussion, including that attrs is actually only strict for the equality operators—something attrs author Hynek Schlawack called an oversight.

There are two other flags for dataclass() that govern whether the class is "frozen" (emulating immutability by raising an exception when any field is assigned to) and whether a __hash__() method will be generated (thus allowing objects to be used as dictionary keys). The two are somewhat intertwined (and interact with the eq flag as well), so the flag interpretations reflect that:

If eq and frozen are both true, Data Classes will generate a __hash__ method for you. If eq is true and frozen is false, __hash__ will be set to None, marking it unhashable (which it is). If eq is false, __hash__ will be left untouched meaning the __hash__ method of the superclass will be used (if the superclass is object, this means it will fall back to id-based hashing).

Although not recommended, you can force Data Classes to create a __hash__ method with hash=True. This might be the case if your class is logically immutable but can nonetheless be mutated. This is a specialized use case and should be considered carefully.

That all seems a little clunky, but it is likely to be a fairly fringe feature that will not see much use.

Fields can be specified using the type annotation syntax (as in the example above), but more control is available using the field() function. That allows fields to be removed from the generated methods using the init, repr, compare, and hash flags. It also provides a way to set the default value, since using field() precludes the usual way to set a default, as an example in the PEP shows:

    @dataclass
    class C:
	x: int
	y: int = field(repr=False)
	z: int = field(repr=False, default=10)
	t: int = 20

Beyond that, there can be a default_factory passed to create new empty objects (e.g. dict, list) for the field, since using [] or {} directly would result in all objects sharing the same list or dictionary. There is also a metadata parameter that can set some user-specific metadata on the Field objects that are created for each field in a data class (and can be retrieved using the fields() method in dataclasses).

There are some other module-level helper functions, such as asdict() and astuple() to convert a data class to a dict or tuple; isdataclass() allows checking to see if an object is an data class instance. There is more to the data class specification, but the summary above hits most of the high notes.

So far, there have been few real objections to the idea. Given that Van Rossum has been actively participating in the threads (and suggested writing a PEP), it would seem highly likely that he will accept the PEP for 3.7. There is working code in the GitHub repository, so there should be little that stands in its way.

The process followed here is an excellent example of how Python development works. Something was posted to python-ideas that was not particularly "Pythonic", it was discussed and a path forward was identified, a PEP was written and has been reviewed by many, changes were made, and we are on the cusp of seeing it in a release. All of that took roughly half a year, though much of the groundwork was laid some time ago. Clearly not all features have such a smooth path—or even any path—into Python, but ideas whose time has come can be adopted fairly rapidly.

Comments (10 posted)

Replacing x86 firmware with Linux and Go

By Jake Edge
November 20, 2017

ELC Europe

The Intel Management Engine (ME), which is a separate processor and operating system running outside of user control on most x86 systems, has long been of concern to users who are security and privacy conscious. Google and others have been working on ways to eliminate as much of that functionality as possible (while still being able to boot and run the system). Ronald Minnich from Google came to Prague to talk about those efforts at the 2017 Embedded Linux Conference Europe.

He began by noting that most times he is talking about firmware, it is with his coreboot hat on. But he removed said "very nice hat", since his talk was "not a coreboot talk". He listed a number of people who had worked on the project to "replace your exploit-ridden firmware with a Linux kernel", including several from partner companies (Two Sigma, Cisco, and Horizon Computing) as well as several other Google employees.

The results they achieved were to drop the boot time on an Open Compute Project (OCP) node from eight minutes to 20 seconds. To his way of thinking, that is "maybe the single least important part" of this work, he said. All of the user-space parts of the boot process are written in Go; that includes everything in initramfs, including init. This brings Linux performance, reliability, and security to the boot process and they were able to eliminate all of the ME and UEFI post-boot activity from the boot process.

Describing the mess

The problem, Minnich said, is that Linux has lost its control of the hardware. Back in the 1990s, when many of us started working with Linux, it controlled everything in the x86 platform. But today there are at least two and a half kernels between Linux and the hardware. Those kernels are proprietary and, not surprisingly, exploit friendly. They run at a higher privilege level than Linux and can manipulate both the hardware and the operating system in various ways. Worse yet, exploits can be written into the flash of the system so that they persist and are difficult or impossible to remove—shredding the motherboard is likely the only way out.

He used to give a talk with the title: "If you trust your computer, you're crazy", due to all of that proprietary code running on our systems. He hopes that this talk will give folks ways to deal with some of those problems, "so we can stop being crazy and maybe get a little sane".

[x86 operating systems]

He showed one of his slides [PDF] (above) that described the seen and unseen operating systems running on an x86 system. Ring 0 is Linux and, because "we ran out of ring numbers", hypervisors like Xen are ring -1, but below that are rings that are running code that you don't have access to, sometimes on processors you don't even know are part of the system. Ring -2 has a kernel and a half kernel; it consists of UEFI, which is the full kernel, and system management mode (SMM), which traps to 8086 16-bit mode, thus the "half" designation. Those control everything about the CPU and are invisible to the rings above. Every time you close the lid of your laptop, or do certain other things, SMM traps to classic 8086 mode; "that should make you happy", he said sarcastically.

Ring -3 is "the one that has people really worried". It runs MINIX 3 and is where the ME runs. It is the cause of the "year of MINIX 3 on the desktop", he joked, since there are more systems with the ME than any of Linux, macOS, or Windows.

There is no common code between the systems running in ring -2 and ring -3 as far as he knows, but they both have a wide range of capabilities. Both have IPv4 and IPv6 networking stacks, filesystems, drivers for various devices (disk, network, USB, ...), and web servers. The ME needs filesystems because it can be used to reimage the system; in fact, Minnich said, it can reimage the system even if the power is turned off as long as it is plugged into the wall and the network.

There is a whole raft of components that make up the ring -3 ME, many of which he does not understand. For example there are components named "full network manageability", "regular network manageability", and "manageability", as well as the "outbreak containment heuristic". He pointed to a Master's thesis [large PDF] from Vassilios Ververis about ten years ago that looked at many different flaws in the ME. It is rather depressing, Minnich said, since it showed that almost every part of the ME could be attacked; some of those bugs still have not been fixed.

[Ronald Minnich]

He referenced the headline of a Wired article about an ME exploit ("Intel Fixes a Critical Bug That Lingered for 7 Dang Years") that he thought was funny. Less funny was the bug itself that allowed a zero-length password to be sent to the web server to give administrator access to systems with the ME. Since the bug was present for seven years, that adds up to around a billion systems, he said, and he strongly doubts that all of those have been patched with a firmware update.

He moved on to the half OS in ring -2. SMM was originally meant to handle power management on DOS systems; it can take over the system out from under ring 0 when certain events (system management interrupts or SMIs) occur. There are a lot of SMI exploits and, once SMM is enabled, it cannot be turned off. It takes 8MB of memory away from the rest of the system for its purposes. SMM is "a good way to maintain vendor control over you", he said.

The other thing running in ring -2 is UEFI; both it and SMM run on the main CPU. UEFI is "an extremely complex kernel"; vendors are writing code for the kernel, but they don't understand all of the rules, so they make mistakes. The result is that "there are big, giant holes that people can drive exploits through". The UEFI security model, as far as he can see, is obscurity.

There are tons of exploits for UEFI, he said. Because UEFI is updated by handing off bits of UEFI code to the UEFI kernel, he is worried that exploits will persistently infect that process, such that it will claim to update itself, but not do so. That only leaves the shredder.

He summarized by reiterating what he had just described: 2.5 hidden OSes with network stacks, web servers, and other capabilities. These OSes have bugs that can persist across power cycles and reinstalls and those bugs have been exploited in the past. His old talk used to end here with a question: "Are you scared yet?"

Fixing the mess

"So how do we fix this mess?", he asked. Some people say to switch to AMD processors, but that is not really a solution now. Ryzen is touted to be open, but that is not truly the case, there are still closed parts. So the project is focusing on Intel x86 processors and has the goal of reducing the scope of the 2.5 OSes. The project is called "non-extensible reduced firmware" (NERF), partly because the team believes the "extensible" in UEFI is harmful. Apparently, there is no overall web page for NERF itself, though some of the components Minnich talks about do have web pages. [Update: As noted in the comments, there is a NERF web page.]

The idea behind NERF is to reduce the harm that the firmware is capable of. In addition, there is an effort to make what the firmware is doing more visible. It does this by removing almost all of the runtime components from the firmware; the "almost" refers to the ME, which is hard to kill completely, he said. If you completely remove the ME, your node probably will not boot, but NERF has taken away the ME's web server and IP stacks. The UEFI IP stack and other drivers have also been removed. Beyond that, the self-reflash capability for ME and UEFI has been removed, so Linux manages all flash updates.

The NERF components are a de-blobbed ME ROM and a UEFI ROM that has been reduced to its most basic parts; in addition, SMM has been disabled or vectored to Linux where that is needed. On top of that runs a Linux kernel with a Go-based user space (u-root). He noted that the project is particularly interested in any Go programmers who want to contribute to just that piece.

They would prefer to remove the ME entirely, but that simply is not an option. If you remove it, the system may not boot, power on, or, if it does power on, it may shut down again in 30 minutes. But there is some good news: the ME has multiple components and most of them can be removed.

He pointed to the me_cleaner project that will process an ME ROM to remove most of it. For example, on the MinnowMax, 5MB of the 8MB flash was used for the ME, but that was reduced to 300KB by me_cleaner. So you only need 300KB of the ME to boot Linux and that gets rid of all of the stuff you really don't want the ME to be doing anyway. The ME reduction is working for MinnowMax and a number of other boards, he said.

If you "get into the game early enough", and they believe their Linux kernel does, SMM can be completely disabled. As far as they can tell, there is no requirement to run SMM; it is mostly there for "value add" by the vendors, which is just a way to try to lock people into their platform. If it ever becomes an issue for some hardware, though, there are ways to vector the SMIs to the kernel. The theme is to keep Linux in control, he said.

UEFI is "huge and extremely complex", but there are a lot of mistakes made in the implementation of it. Some interrupts, including memory-error-detection interrupts, still need to be routed to UEFI, though. They want to remove the opportunities for UEFI drivers to put in exploits by making it non-extensible. He showed "an eye chart" of all of the different services that UEFI provides; he noted that it looks like a kernel, because it is, and said that "it is a sizable fraction of the size of Linux".

Next up, he showed the standard UEFI boot process; it starts with two phases (security or SEC and pre-EFI initialization or PEI) that are completely proprietary and will never be released by the vendors, he said. Beyond that, though, the next phase, which is called the driver execution environment (DXE), has a well-defined interface that multiple components (DXE core, drivers, boot manager, ...) conform to.

The boot manager is responsible for starting up the operating system. When you see the screen that allows choosing what to boot on a UEFI system, that is the boot manager. What they have done is to replace the boot manager with a Linux kernel that conforms to the DXE interface. On the OCP node system that was being demonstrated elsewhere at the conference, booting the Linux kernel took 20 seconds from power-on; the Go-based user space does a DHCP query, a wget for the server kernel, and then a kexec into the new kernel, which takes an additional three seconds. There are plans to replace the DXE core component with something that is open source and knows more about how to boot Linux; that should reduce the boot time even further, Minnich said.

He does not believe that we will ever get access to that early boot code (SEC and PEI) for UEFI. Even for Chromebooks running coreboot, that piece is a binary blob. The best we can do, he said, is to replace the pieces at that well-defined interface, which is what has been done. In addition, the goal was to get rid of all the UEFI runtime services, which has been accomplished.

As part of his Heads project, Trammel Hudson has put together some Makefiles and the like to create a NERF image. That can be used with a custom kernel and initramfs to replace as much of UEFI as possible. They have had good results on a Dell server, the MinowMax, and the OCP nodes.

Using Linux makes the firmware easier to work with, Minnich said. Normally, there are lots of fiddly, hardware-specific pieces that need to be changed in the firmware, but using the DXE interface makes a lot of those problems go away. He expected that different kernels would be required for the different systems, but he has been using the same kernel on the MinnowMax, which is a small system, and the OCP node, which is a rather large system.

The user-space piece is all written in Go, which is generally more trusted than C within Google, he said. The 5.9MB initramfs contains all of the source code for the user space, all of the Go compiler and package sources, and a Go toolchain. The commands are built on the fly, as they are needed, which usually takes around 200ms per command; once they are built, it is "nearly instantaneous" (1ms) for them to run. From a security angle, that's good because all of the source is available to be examined.

For cases where there is not sufficient space for an initramfs of that size or enough CPU power to do even a fast compile step on the way to booting, there is another mode for the u-root Go commands. It is like BusyBox, in that there is one binary that is linked to a bunch of different command names; this mode uses the Go abstract syntax tree package to rewrite the commands as packages. That reduces the footprint to 2MB, which is useful on systems with less flash space.

There are some implications of the u-root work that has Minnich thinking about booting for desktop systems. With u-root, there are no scripts or unit files to deal with, there is simply a single program that boots the system, which leads to "things coming up really fast". It is more understandable for him and makes the boot process faster. There is a project at Google, called NiChrome, that can bring up a Chromebook all the way from power-up to X11 and a browser in five seconds.

Go is a compiled language, but it is often used for scripting. Minnich uses it that way "all the time"; he stopped writing Bash scripts years ago in favor of Go. It is "easier and more reliable" to write scripts in Go.

He concluded by saying that he is hoping to see companies ship hardware with NERF and u-root in 2018. Companies want to have firmware that they understand, he said; they also want it to boot quickly and be secure. In the Q&A, Minnich was asked about secure boot and TPMs. Neither is supported currently, though there is a non-working verified boot program in u-root at this point. For TPM support, he thinks the project will follow what Chrome OS has done, rather than take the secure boot path.

He was also asked about the relationship of this work to coreboot. Minnich said that coreboot should always be preferred, but it has not been available for server platforms for 12 years. So he would suggest that developers "always use coreboot if you can", but if not, look at NERF.

Those interested can view the YouTube video of Minnich's talk.

[I would like to thank LWN's travel sponsor, the Linux Foundation, for supporting my travel to Prague for ELC Europe.]

Comments (50 posted)

SPDX identifiers in the kernel

By Jonathan Corbet
November 16, 2017
Observers of the kernel's commit stream or mailing lists will have seen a certain amount of traffic referring to the addition of SPDX license identifiers to kernel source files. For many, this may be their first encounter with SPDX. But the SPDX effort has been going on for some years; this article describes SPDX, along with why and how the kernel community intends to use it.

On its face, compliance with licenses like the GPL seems like a straightforward task. But it quickly becomes complicated for a company that is shipping a wide range of software, in various versions, in a whole set of different products. Compliance problems often come about not because a given company wants to flout a license, but instead because that company has lost track of which licenses it needs to comply with and for which versions of which software. SPDX has its roots in an effort that began in 2009 to help companies get a handle on what their compliance obligations actually are.

It can be surprisingly hard to determine which licenses apply to a given repository full of software. The kernel's COPYING file states that it can be distributed under the terms of version 2 of the GNU General Public License. But many of the source files within the kernel tell a different story; some are BSD licensed, and many are dual-licensed. Some carry an exception to make it clear that user-space programs are not a derived product of the kernel. Occasionally, files with GPL-incompatible licenses have been found (and fixed).

A great many files in the kernel source tree carry no license text at all. One might presume that these files are covered by GPLv2 but, as we'll see, the situation may not be quite that simple. No-license files are also problematic because the Developer Certificate of Origin, which governs contributions to the kernel, refers explicitly to "the open source license indicated in the file". If there is no license indicated in the file, the meaning of that phrase is not entirely clear.

Another complicating factor is that the license text in kernel source files, when it is present at all, is entirely free-form. There are hundreds of variants of the GPLv2 text alone. That can make it hard for human readers to figure out what's going on, but it is even more challenging for software. It is not currently possible to run a tool on the kernel repository (or that of many other projects) and get a definitive list of the operative licenses.

The Software Package Data Exchange (SPDX) standard is an attempt to address this aspect of the licensing problem. This effort, which has come under the umbrella of the Linux Foundation's compliance program, has defined a way to declare licensing information that is intended to be easily read by both humans and machines. At its core, SPDX defines a single-line string to specify the license governing a file. It looks something like:

    SPDX-License-Identifier: GPL-2.0

There is a long list of known licenses and the ability to add extra conditions or exceptions where needed. If each file in a repository contains one of these strings, summing up the licensing information for the repository as a whole becomes a straightforward affair.

SPDX has been adopted in various parts of the industry in recent years. The effort to add SPDX identifiers to the kernel has been playing out, mostly in private, for at least a couple of years. It recently surfaced in the form of a huge patch set adding SPDX identifiers to over 12,000 kernel source files that did not have any license information at all, and as a brief discussion at the 2017 Maintainers Summit. Somewhat later, some documentation on the project surfaced. Fully documenting the kernel with SPDX tags will take a while, but the process is well underway at this point.

For kernel source files, the decision was made that the SPDX tag should appear as the first line in the file (or the second line for scripts where the first line must be the #! string). For normal C source files, the string will be a comment using the "//" syntax; header files, instead, use traditional (/* */) comments for reasons related to tooling. Thus, for example, if one looks at arch/alpha/include/uapi/asm/a.out.h, one will see at the top:

    /* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */

The WITH string says that the kernel's user-space exception applies to this file, since it defines part of the system-call ABI.

Kernel developers are often short of patience for things that look like bureaucratic exercises, so it would not have been surprising to see some opposition to this project. In truth, there has been little. The biggest issue would appear to be that some of the no-license files that were marked as GPLv2 should maybe carry a different license. One could argue that this kind of disagreement is a good thing, in that it points out a place where the license applying to a specific file was not what most people might expect. Once this kind of problem comes to light, it can be addressed.

The plan is to eventually have SPDX tags in all kernel source files, but that process could take some time. For each file that already carries a license text, somebody has to look and ensure that the SPDX tag matches that text exactly. Given that there are around 60,000 files in the kernel repository, that's a fair amount of work. An additional goal is to eventually get rid of the other license texts; the consensus seems to be that the SPDX identifier is a sufficient declaration of the license on its own. But removing license text from source files must be done with a great deal of care, so it may be a long time before anybody works up the courage to attempt that on any files that they do not themselves own the copyright for.

It would not be surprising to see the process of adding SPDX tags extend over years. There will likely be an occasional flare-up as this work uncovers files with ambiguous or uncertain licensing, but that should result in more clarity around the licensing of the kernel as a whole once things are worked out. At the end, perhaps we'll know what the kernel's license story really is.

Comments (27 posted)

4.15 Merge window part 1

By Jonathan Corbet
November 17, 2017
When he released 4.14, Linus Torvalds warned that the 4.15 merge window might be shorter than usual due to the US Thanksgiving holiday. Subsystem maintainers would appear to have heard him; as of this writing, over 8,800 non-merge changesets have been pulled into the mainline since the opening of the 4.15 merge window. Read on for a summary of the most interesting changes found in that first set of patches.

Core kernel

  • The control-group v2 subsystem finally has a CPU controller, bringing a long story to a happy ending.
  • The live-patching mechanism has seen a couple of significant improvements. The "shadow variables" mechanism allows the addition of data to structures; it will be used in patches that make data-structure modifications. There is also a new callback mechanism that can invoke kernel code when an object is patched, extending the ability to apply live patches affecting tricky areas like global data or assembly code.

Architecture-specific

  • The openrisc architecture has gained support for SMP systems.
  • The RISC-V architecture is now supported — sort of. "The port is definitely a work in progress. While what's there builds and boots with 4.14, it's a bit hard to actually see anything happen because there are no device drivers yet".
  • AMD's secure encrypted virtualization feature is now supported. This feature, which builds on the secure memory encryption work merged in 4.14, allows virtual machines to run with memory that is encrypted and unreadable by other virtual machines or the host system.
  • Intel's user-mode instruction prevention (UMIP) feature, which disables user-mode access to specific security-relevant instructions, is supported. The feature is disabled by default because it breaks some applications (Wine, for example), but the plan is to address these problems during this development cycle.
  • The arm64 architecture has gained support for the scalable vector extension mechanism.

Filesystems/block layer

  • The Smack security module is now able to work with the overlayfs union filesystem.
  • The XFS filesystem has gained initial support for online filesystem checking. This feature is incomplete and is not yet intended for production use.
  • The NVMe block driver has gained native multipath support, enabling high-performance concurrent I/O on high-end systems.

Networking

  • The networking layer now supports the "ThunderboltIP" protocol for passing IP packets over a Thunderbolt cable.
  • Support for SCTP stream schedulers has been added. Three schedulers (FCFS, priority, and round-robin) have been merged.
  • Most TCP-related sysctl knobs have been made aware of network namespaces.
  • The network queueing discipline subsystem now has a "credit-based shaper" module. Such documentation as exists can be found in this commit.

BPF

  • The user-space bpftool utility can be used to examine and manipulate BPF programs and maps; see this man page for more information.
  • Hooks have been added to allow security modules to control access to BPF objects; see this changelog for more information.
  • A new BPF-based device controller has been added; it uses the version-2 control-group interface. Documentation for this feature is entirely absent, but one can look at the sample program added in this commit that uses it.

Hardware support

  • GPIO: Maxim MAX3191x industrial serializers, UniPhier GPIO controllers, and NVIDIA Tegra186 GPIO controllers.
  • Graphics: Samsung S6E63J0X03 DSI command mode panels, Orise Technology otm8009a 480x800 dsi 2dl panels, Seiko 43WVF1G panels, Faraday TVE200 TV encoders, Rockchip LVDS controllers, Silicon Image SiI9234 HDMI/MHL bridges, and Raspberry Pi 7-inch touchscreen panels.
  • Industrial I/O: Maxim Integrated DS4422/DS4424 DACs, RF Digital RFD77402 time-of-flight sensors, and Texas Instruments 8/10/12-bit 2/4-channel DACs.
  • Input: EETI EXC3000 multi-touch panels, HiDeep touchscreens, and Samsung S6SY761 touchscreen controllers.
  • Media: Sigma Designs SMP86xx IR decoders, Rockchip Raster 2d graphic acceleration units, Sony IMX274 sensors, and Tegra HDMI CEC interfaces.
  • Miscellaneous: Maxim MAX6621 temperature sensors, Maxim MAX31785 fan controllers, TI SDHCI controllers, Amlogic Meson6/Meson8/Meson8b SD/MMC host controllers, Amlogic Meson GPIO interrupt multiplexers, Socionext external interrupt units, STMicroelectronics STM32 DMA multiplexers, STMicroelectronics STM32 master DMA controllers, Spreadtrum DMA controllers, PC Engines APU/APU2 LED controllers, HiSilicon STB PCIe host bridges, V3 Semiconductor PCI controllers, Intel Cherry Trail Dollar Cove TI power-management ICs, Spreadtrum SC27xx power-management ICs, and Texas Instruments DP83822 network PHYs.
  • USB: TI TPS6598x USB power delivery controllers and Broadcom STB USB PHYs.
  • The legacy Open Sound System audio drivers have been disabled since 4.12; as of 4.15, they have been removed entirely.
  • The new LED activity trigger mechanism can use an attached LED to indicate the level of CPU activity in the system.

Internal kernel changes

  • There are a couple of new helper scripts for people working on the documentation. find-unused-docs.sh will look for kerneldoc comments to exported functions that are not actually used in the formatted documentation. documentation-file-ref-check can be used to find references to nonexistent files in the documentation.
  • The regmap framework now has support for using hardware spinlocks to control access to registers.
  • The s390 architecture has gained alternatives support, allowing the kernel to patch itself at boot time to use newer instructions when they are available.
  • The lockdep crossrelease mechanism was disabled in 4.14 due to various problems; those have been fixed and crossrelease is available once again in 4.15.
  • The new down_read_killable() helper will attempt to take a reader/writer semaphore for read access while keeping the process killable by user space.
  • Work toward getting rid of ACCESS_ONCE() continues; code should use READ_ONCE() or WRITE_ONCE() instead.
  • There is a new timer function:

        int timer_reduce(struct timer_list *timer, unsigned long expires);
    

    It will (1) start the timer if it is not currently running, and (2) set the expiration to expires if expires is sooner than the current value.

  • The kmemcheck memory-usage debugging tool has been removed from the kernel; it has been superseded by tools like KASAN.
  • The __GFP_COLD memory-allocation flag, used to request a cache-cold page, has been removed. It wasn't properly implemented anyway, and the benefits from using it were far from clear.

Conclusion

Additionally, of the 8,861 changesets merged so far, 300 mention timer_setup(), making them part of the ongoing timer API change. There are also 57 patches adding SPDX identifiers.

By the normal schedule, the 4.15 merge window would end on November 26, with the final 4.15 release happening in mid-January. But, as mentioned above, the Thanksgiving holiday could change things, causing the merge window to be either shorter or longer than usual. However it plays out, LWN will run a followup article covering the rest of this merge window.

Comments (none posted)

4.15 Merge window part 2

By Jonathan Corbet
November 28, 2017
Despite the warnings that the 4.15 merge window could be either longer or shorter than usual, the 4.15-rc1 prepatch came out right on schedule on November 26. Anybody who was expecting a quiet development cycle this time around is in for a surprise, though; 12,599 non-merge changesets were pulled into the mainline during the 4.15 merge window, 1,000 more than were seen in the 4.14 merge window. The first 8,800 of those changes were covered in this summary; what follows is a look at what came after.

Core kernel

  • User namespaces have, thus far, only supported five UID or GID mappings. With 4.15, that limit has been raised to 340.
  • The MAP_SYNC mechanism has been added to allow user-space applications to take control of cache flushing for nonvolatile memory arrays. It works by forcing a metadata flush on the relevant file before allowing a write fault to succeed, thus ensuring that the application's view of the file layout is consistent with the kernel's view.
  • The cramfs compressed filesystem has seen some significant changes. It can now handle filesystems mapped directly into memory (in persistent memory, for example); this feature, when combined with uncompressed regions, allows execute-in-place support.

Architecture-specific

  • The SPARC architecture has gained support for virtual dynamic shared objects (vDSO) exported by the kernel.

Filesystems/block layer

  • The AFS filesystem has seen a great deal of work. It now supports network namespaces (partially, this work is not yet complete), writable mmap() areas are supported, and more; see this merge commit for more information. Note that AFS no longer supports pre-3.4 servers, so users who have not upgraded since 1998 will have trouble with 4.15.
  • The f2fs filesystem has improved quota support, a feature that will evidently be used by Android.

Hardware support

  • Clock: R-Car V3M clocks, Mediatek MT2712 and MT7622 clocks, NXP PCF85363 realtime clocks, and Spreadtrum SC27xx realtime clocks.
  • Graphics: The AMD Display Core subsystem, which ran into trouble in late 2016, has been merged for 4.15 after some significant changes. There is still work to do, but it has been concluded that this work is best done in-tree; see this merge commit for the story. This patch series contained over 1,100 changesets and added 132,000 lines of code to the kernel.
  • Miscellaneous: Intel Cedar Fork pin controllers, Texas Instruments interconnect target modules, NVIDIA Tegra BPMP thermal sensors, Technologic Systems NBUS controllers, Broadcom STB AVS TMON thermal subsystems, and MicroSemi Switchtec non-transparent bridges.

Internal kernel changes

  • The tracing subsystem can now trace module initialization functions. It is also now possible to trace the disabling and enabling of both preemption and interrupts.
  • Warnings generated by WARN_ONCE() are normally only printed once during the life of the system. The new debugfs file /sys/kernel/debug/clear_warn_once can be used to reset those warnings; writing "1" to that file will do the trick.
  • The kernel build subsystem has gained the ability to cache the results of a number of shell operations (those used to set internal variables, for example). The result should be faster kernel builds.
  • The clock provider subsystem has gained runtime power-management support.
  • The huge timer API transition has completed, and the old init_timer() function has been removed.

The 4.15 feature set is now mostly complete, though the possibility of a late pull or two was mentioned in the 4.15-rc1 announcement. If the usual schedule holds, the final 4.15 kernel can be expected on January 14 or 21. Before then, though, there is a lot of testing and fixing to be done.

Comments (2 posted)

BPF-based error injection for the kernel

By Jonathan Corbet
November 29, 2017
Diligent developers do their best to anticipate things that can go wrong and write appropriate error-handling code. Unfortunately, error-handling code is especially hard to test and, as a result, often goes untested; the code meant to deal with errors, in other words, is likely to contain errors itself. One way of finding those bugs is to inject errors into a running system and watching how it responds; the kernel may soon have a new mechanism for doing this sort of injection.

As an example of error handling in the kernel, consider memory allocations. There are few tasks that can be performed in kernel space without allocating memory to work with. Memory allocation operations can fail (in theory, at least), so any code that contains a call to a function like kmalloc() must check the returned pointer and do the right thing if the requested memory was not actually allocated. But kmalloc() almost never fails in a running kernel, so testing the failure-handling paths is hard. It is probably fair to say that a large percentage of allocation-failure paths in the kernel have never been executed; some of those are certainly wrong.

The kernel gained a fault-injection framework back in 2006; it can be used to test error-handling paths by causing memory allocation requests to fail. Just making kmalloc() fail universally is unlikely to be helpful, though; execution will almost certainly never make it to the code that the developer actually wants to test. The fault-injection framework has some parameters to control which allocation attempts should fail, but the mechanism is somewhat awkward to use and is not as flexible as one might like. So the number of developers actually using this framework is small.

Fully generalizing fault injection would be a lot of work. A developer may want to see what happens when a specific kmalloc() call fails, but perhaps only when it is invoked from a specific call path or when some other condition is true. It has not been possible in the past to describe these conditions to the framework but, in recent years, a new technology has come along that can provide the required flexibility: the BPF virtual machine.

It is already possible to attach a BPF program to an arbitrary function using the kprobe mechanism. Such programs are useful for information gathering, but they cannot be used to affect the execution of the function they are attached to. Thus, they are not usable for error injection. That situation changes, though, with this patch set from Josef Bacik, which is intended to turn BPF into a generalized mechanism for the injection of errors into a running kernel.

The core of the new mechanism is a BPF-callable function called bpf_override_return(). If a BPF program attached to a kprobe calls this function, the execution of the function the program is attached to will be shorted out and its return value will be replaced with a value supplied by that BPF program. The patch set contains an example in the form of a test program:

    SEC("kprobe/open_ctree")
    int bpf_prog1(struct pt_regs *ctx)
    {
	unsigned long rc = -12;

	bpf_override_return(ctx, rc);
	return 0;
    }

This function can be compiled to BPF using the LLVM compiler. The SEC() directive at the top specifies that this function should be attached to a kprobe placed at the beginning of open_ctree(), a function in the Btrfs filesystem implementation. After the placement of this probe and the attachment of the BPF function, a call to open_ctree() will be overridden and the value -12 (-ENOMEM) will be returned. This is a relatively simplistic example, of course; it is expected that many uses will require more sophisticated BPF programs to narrow down the set of situations where the injection will occur.

This patch set had been through several revisions and appeared ready for inclusion into the mainline; it had even been applied to the networking tree for the 4.15 merge window. Things came to a halt, though, when Ingo Molnar blocked the progress of this patch set out of worries that it violated one of the basic promises behind the BPF virtual machine and could destabilize the kernel:

One of the major advantages of having an in-kernel BPF sandbox is to never crash the kernel - and allowing BPF programs to just randomly modify the return value of kernel functions sounds immensely broken to me.

After some discussion, a solution was agreed to: BPF programs would retain the ability to override kernel functions, but only for functions that have been specifically marked to allow this to happen. A new macro called BPF_ALLOW_ERROR_INJECTION() was introduced; it can be used to add the required annotation to a function. See, for example, this patch adding the marking for open_ctree(). Molnar suggested some additional conditions — only functions whose return value cannot crash the kernel should be annotated, and the override function should only change integer error values — but nothing enforces those rules in the current patch set.

Bacik's patch set only marks that one function; it is not clear whether those markings will be added in any quantity to the mainline kernel, or whether they will, instead, be maintained as private patches by the developers who use them. One can imagine that there could be some resistance to marking up the mainline in this way. But, on the other hand, there would be value in marking functions like kmalloc() to enable the development of generic tools that can be used to test specific allocation-error handling paths.

That question is only likely to be resolved once the mechanism is in place and patches marking functions for error injection start to appear. Meanwhile, the objections to the core mechanism have been addressed, and its path into the mainline appears to be clear. It has missed the 4.15 merge window, though, so it will almost certainly have to wait until 4.16.

Comments (15 posted)

Tools for porting drivers

By Jake Edge
November 27, 2017

OSS Europe

Out-of-tree drivers are a maintenance headache, since customers may want to use them in newer kernels. But even those drivers that get merged into the mainline may need to be backported at times. Coccinelle developer Julia Lawall introduced the audience at Open Source Summit Europe to some new tools that can help make both forward-porting and backporting drivers easier.

She opened her talk by noting that she was presenting step one in her plans, she hoped to be able to report on step two next year some time. The problem she is trying to address is that the Linux kernel keeps moving on. A vendor might create a driver for the 4.4 kernel but, over the next six months, the kernel will have moved ahead by another two versions. There are lots of changes with each new kernel, including API changes that require driver changes to keep up.

That means that vendors need to continually do maintenance on their drivers unless they get them upstream, where they will get forward-ported by the community. But the reverse problem is there as well: once a device becomes popular, customers may start asking for it to run with older kernels too. That means backporting.

[Julia Lawall]

There is an obvious methodology to the porting process, Lawall said. Compile the driver with the target kernel and see what breaks. The compiler will likely complain about various things that point to where things are broken. Since the driver is unchanged, any breakage that is reported are from places where the kernel has changed; GCC will point toward the parts of the driver that need to be fixed.

In order to figure out what needs to be done, one can look at other commits in the code history that update the code for those kinds of changes. If the driver is for a webcam, for example, it is probably not the only such driver that has needed to be updated for these changes. To fix the target driver, one can look at other webcam drivers to see what they did, then apply those changes to the target.

The assumption here is that if the driver compiles with the kernel, then it works. That is not always the case, Lawall said, but getting it to compile is progress toward getting it to work.

She showed an example (slides [PDF]) of the lms501kf03 TFT LCD panel driver. It was introduced in Linux 3.9 and she showed how this process works by trying to port it to the 4.6 kernel. First she compiled it with 4.6, which led to a number of GCC complaints. She noted that only two of the errors were significant; those related to two unknown fields that were being initialized in the driver (suspend and resume in struct spi_driver).

She looked for other commits that had removed those two fields and found a commit in the as3935 driver that did so. Looking at the patch, she noted that those functions had moved into a new spi_driver field called pm (which is in the driver sub-structure that is part of spi_driver). She further noted that the functions assigned to suspend and resume had changed their signature; they now took different parameters. She went through the process of seeing how the as3935 driver had changed to accommodate the underlying kernel changes.

Those changes that were made to the as3935 driver can be considered to see if they can be made to the lms501kf03 driver. If they can, then it is fairly straightforward to remove those fields, change the functions, and populate the structure properly for the 4.6 kernel. That is the methodology that she is trying to support with tools.

There are a number of challenges to overcome to get there. The error messages from GCC are redundant and inconsistent. Problems earlier in the code may lead to strange syntax errors reported that muddy the waters, for example. The significant errors can end up requiring multiple changes in the code (as shown by the example). In addition, it may require finding multiple different example commits to fix the problems found. And, in finding those commits, it is important to choose the right ones; using git log with the -g or -s options will bring up a large number of commits that are not relevant. There are lots of structures in the kernel with suspend and resume fields, but the ones of interest for the example are only those in spi_driver structures.

So how can these examples be found? Her proposed solution has two pieces (as yet): a tool to reduce the noise from GCC and a tool to find example commits of interest. The first is called gcc-reduce; it takes the output of GCC, removes the redundant error messages and collects complementary information for those errors from the source code (e.g. the type of a structure that is missing a field). It would be easier to do this work for LLVM, she said, but kernel development is focused on GCC, so that's where the effort has gone.

The second piece is called Prequel and is used to find the commits that provide example solutions to porting issues. In the future, she would like to add a way to auto-generate patches based on the examples found. Prequel, which is related to Coccinelle, uses a notation that will be familiar to Linux developers and others who have used Coccinelle. It effectively incorporates diff-like notation into Coccinelle rules, so a query for patches that remove the suspend field from an spi_driver structure would be as follows:

    @@
    identifier i;
    expression e;
    @@

    struct spi_driver i = {
    - .suspend
	     = e,
    };

Unlike Coccinelle, though, Prequel is simply doing pattern matching, not transformation. The query can be fed to Prequel along with a range of kernel commits and it will produce a list of all of the commits that match the pattern. It will list the commit IDs along with a percentage that indicates how much of the commit was matched by the rule. It can produce that as a text file, but there is also integration with Emacs Org mode and with Vim.

There are some 70,000 kernel commits each year, so it is not practical to extract and match on each one for each query. In order to have reasonable performance, indexes have been created of "words" on and near changed lines.

In general, learning the notation is not needed for driver porting as there are query templates that suffice for the typical cases. There are generally just a small set of porting issues that need to be addressed. Those include undefined variables, functions, fields, or types; the wrong number of arguments to a function or macro; and incompatible types in assignment, initialization, or as function arguments.

In order to evaluate the tools, drivers were ported both forward and back to and from the 4.6 kernel. The forward ports were from 2013 (13 drivers) and from 2015 (ten drivers), while ten backports from 4.6 to the original 2013 version were done. In all of those, 107 issues were reported by gcc-reduce and the Prequel templates were used to find the right changes for 80 of them. For 86% of those 80 issues, the first commit returned by Prequel contained the information of interest.

There are some limitations to the approach. It assumes that each issue is addressed in a single commit, which was not the case for five issues that were found in the experiment. For example, a field may get renamed temporarily (e.g. x to x_new); Prequel finds the first change, but not when x_new gets renamed back to x. Iteration is a possible solution to that problem. In some cases, gcc-reduce can misidentify the root cause of a problem, which happened for six of the issues in the evaluation.

Backporting is typically harder than forward porting, she said. Some of the commits in the history may actually be going from good code to good code (e.g. a cleanup of some kind) or just be fixing a bug. Those can confuse the queries, so there may need to be some more work done there.

The overall project web page has information on the tools and experiments. There are plans to improve the commit-ranking algorithm and to better infer the changes from the commits that are identified. In the Q&A, one audience member wondered if looking at a patch series, rather than an individual patch, might help identify the changes needed. Lawall agreed that it probably would, but that information is not carried in the Git commit stream; it could potentially be synthesized using the author and date information.

[I would like to thank LWN's travel sponsor, the Linux Foundation, for supporting my travel to Prague for OSS Europe.]

Comments (7 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds