|
|
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 4.6-rc2, released on April 3. Linus said: "You all know the drill by now - another week, another rc. I'd say that things look fairly normal at this point: it's not a big rc2, but that's been true lately (rc3 tends to be a bit bigger - probably just because it takes time for people to start noticing issues)."

Stable updates: none have been released since March 16.

Comments (none posted)

Quote of the week

Because the *last* thing you want is programmers saying "I'm so important that I want the special futex". Because every single programmer thinks they are special and that _their_ code is special. I know - because I'm special.
Linus Torvalds

Comments (none posted)

QOTW2: Busted

No I am not Nick Krause. I am just aware of how he got banned a few years ago. That email was a mistake by typo and was hoping nobody picked it up as they would then believe I was Nick Krause.
"Bastien Philbert"

    Received: from [192.168.0.11]
        (CPEbc4dfb2691f3-CMbc4dfb2691f0.cpe.net.cable.rogers.com. [99.231.110.121])
        by smtp.gmail.com with ESMTPSA id w69sm1687054qhw.3.2016.04.06.10.23.24
    	(version=TLSv1/SSLv3 cipher=OTHER); Wed, 06 Apr 2016 10:23:25 -0700 (PDT)
— from the message headers

    Received: from [192.168.0.11]
 	(CPEbc4dfb2691f3-CMbc4dfb2691f0.cpe.net.cable.rogers.com. [99.231.110.121])
	 by smtp.googlemail.com with ESMTPSA id o201sm11982708ioe.15.2016.02.22.12.12.53
	 (version=TLSv1/SSLv3 cipher=OTHER); Mon, 22 Feb 2016 12:12:53 -0800 (PST)
— from an earlier message from Nick

Comments (14 posted)

Kernel development news

Analyzing the patchiness of vendor kernels

By Nathan Willis
April 6, 2016

ELC
A great many devices in the embedded Linux world run vendor-supplied kernels. These kernels are not necessarily dangerous simply because they contain code not found in the mainline kernel releases, but they are still a cause for concern. They introduce a maintenance burden at the very least, as users (either end users or downstream commercial partners) must work to apply out-of-tree patches when they migrate to newer kernel releases. At worst, the vendor's out-of-tree patches can include code that introduces security worries. At the 2016 Embedded Linux Conference in San Diego, Hisao Munakata presented a recent effort he led to systematically measure and assess the kernels that ship in vendor board support packages (BSPs).

The full title of Munakata's talk was "Digitization of kernel diversion from the upstream to minimize local code modifications." As the latter half of that title suggests, Munakata believes that measuring the number and riskiness of out-of-tree patches is only the first step; SoC vendors and board manufacturers can and should seek to reduce their kernels' divergence from mainline over time. He works for the embedded Linux device vendor Renesas and tested the measurement approach against the company's own R-Car boards, which are designed for use in automotive projects. Most of the problems that embedded developers encounter come from working with BSPs, he said, so assessing a BSP's "health" is a task all vendors would do well to consider.

[Hisao Munakata]

Unlike those working on desktop and server Linux projects, Munakata said, embedded developers do not have a turnkey Linux distribution that they can rely on to provide hardware support for the boards or system-on-chips (SoCs) they use. Instead, they almost always rely on the board or SoC vendor's BSP and its included kernel, which they combine in-house with other components.

Vendors pick "random kernels" to ship, he said, while the board is in development. The Long-Term Support Initiative (LTSI) has helped standardize such selections to a degree, but not entirely. Because BSP kernels are generally developed by the vendor while the product is also in development, the hardware is often revised several times prior to release. Thus, the drivers and other patches are often written and tested rapidly and may include patches that are little more than "quick and dirty hacks." Those patches may work for the board in question but break for other use cases, so they will never get merged—assuming they are ever sent upstream.

For users, these BSP kernels cause two main problems: difficulty when migrating to a new kernel releases and difficulty applying security patches. Migration to a new release is sometimes required, even when it is arduous, to make use of a significant new feature or API. Vendor patches that touch the kernel outside of device drivers make this process more complicated. The security-patch issue is similar, except that a security fix backported to an older kernel release also adds a sense of urgency.

Considering these issues, Munakata said, he decided it would be useful to have a "sanity assessment" check for BSP kernels. Ideally, he said, a vendor kernel would include a human-readable "certification of contents" file (in its board's bill of materials or BOM) that captured where the kernel differs from the mainline kernel. The file would describe the purpose of each patch, the size and location of each patch in the tree, and some sort of qualitative measurement of each patch's riskiness. He proposed a simple, three-class system for categorizing riskiness: "clean" for patches that merely enable support for new hardware features or backport a relatively self-contained feature, "safe" for patches that implement minor fixes, and "dirty" for patches that either rewrite or outright break functionality from the mainline kernel.

Considering the size of contemporary kernels, he said, such an assessment has to be generated programmatically. Initially, he tried counting the matches and mismatches of SHA-256 file hashes, using the Yaminabe tool. That method provided insufficient detail, however, because trivial and major changes both trigger a hash mismatch. He also tried tracing patches by Git commit IDs, which he said was better at determining how many in-house patch sets have been applied to a kernel, but still fell far short of the goal, since it does not provide any way to measure the riskiness of a particular patch. Naturally, that approach is also limited to those vendors that manage their kernel patches in Git.

The solution he finally settled on was combining Yaminabe's simple hash-mismatch hits with a second scan using a locality-sensitive hash, a hash function that gives similar hashes for similar files and progressively diverging hashes the more that the files compared differ. The Trend Micro Locality Sensitive Hash (TLSH) is such a hash available in open-source form. The result, built in collaboration with Armijn Hemel, is Yaminabe2. It can compare two Git trees, first weeding out identical files by comparing SHA-256 hashes. For the mismatches, it computes the TLSH hashes and reports a "distance" score for each pair.

Transforming those distance scores into the "clean," "safe," and "dirty" assessments desired took a bit of investigative work. Running Yaminabe2 against various revisions of the R-Car BSP, the distance scores reported ranged from less than 10 (for trivial differences) up to nearly 400. After several rounds of testing, he settled on some cut-off points. A distance score less than 60 can be marked as "clean;" a distance between 60 and 150 can be marked as "safe," and any distance above 150 likely indicates that a patched file is "dirty." Naturally, such cut-off points can catch some false positives, and any such assessment also needs to take a file's location in the kernel source tree into account (e.g., whether it is a device driver or not). But they serve as a good starting point for further exploration.

Munakata showed sample output from his R-Car BSP tests. For the sake of speed, Yaminabe2 uses a pre-computed database of SHA-256 and TLSH hash scores for the comparison kernel (that is, the mainline) and only processes the "target" Git tree for each run. The terminal version of the comparison tool reports the total number of mismatched files, then lists each file and its distance score, followed by the total accumulated distance for the entire tree. For the R-Car second-generation board, based on the 3.10 LTSI kernel, the first BSP release tested (version 1.1.0) had a total distance of 63,616, while the last (version 1.9.6) had a total distance of 72,242. That increase reflects the growing number of bug fixes applied. Over time, he said, he hopes to track such total distance scores as aide to decreasing the R-Car BSP's divergence from mainline.

The third-generation board, which was released in December 2015, is currently working to keep pace with the current kernel, he said, because the company wants to settle on the 2017 LTSI kernel for the board's long-term support. Interestingly enough, he added, the total distance reported by Yaminabe2 is presently higher for the R-Car third generation, which he finds perplexing. But there are still many opportunities to tune Yaminabe2: it could be taught to ignore files for irrelevant architectures or features disabled by the BSP kernel configuration, for example. Perhaps one of those factors accounts for the unexpected results for the new R-Car BSP.

The biggest improvement, he said, would be for other vendors to test Yaminabe2 against their own BSPs, and for companies buying boards or SoCs to run tests against the BSPs provided by their vendors. This would be particularly useful as a way to further refine the risk-scoring system, which at present is quite simple. Munakata added that he also hopes to improve Yaminabe2's reporting features, and to simplify the setup process. At the moment, each Yaminabe2 installation requires a fair amount of customization to tune the TLSH parameters to fit the hardware available (since TLSH is quite CPU-intensive; a full scan can take 12 or more hours).

The code is currently available only as a tar archive on the Yaminabe2 eLinux wiki page, while TLSH must be retrieved from that project's GitHub page. Yaminabe2 is a Python application, and the bundle includes everything users will need to scan and compare any two arbitrary Git trees. But Munakata advised interested parties to start with the (multi-gigabyte) pre-built SHA-256 and TLSH databases also linked to from the wiki page, since scanning and hashing the mainline kernel is a multi-hour operation even on a fast machine.

There are many ways one could gauge the riskiness of vendor-supplied kernel. Munakata and Hemel's work on Yaminabe2 offers just one approach, and it certainly needs more widespread testing. But it may prove to be a good start toward solving an under-addressed problem: helping embedded users get a handle on precisely where and how their BSP kernel diverges from the mainstream.

[The author would like to thank the Linux Foundation for travel assistance to attend ELC 2016.]

Comments (1 posted)

Improvements in CPU frequency management

April 6, 2016

This article was contributed by Neil Brown

A few years ago, Linux scheduler maintainer Ingo Molnar expressed a strong desire that future improvements in CPU power management should integrate well with the scheduler and not try to work independently. Since then, there have been improvements to the way the scheduler estimates load — per-entity load tracking in particular — and some patch sets circulating that aim to link these improved estimates into the CPU frequency scaling, but the recently released 4.5 kernel still does CPU power management as it has done it for years. This situation appears to be about to change with some preliminary work already merged into 4.6-rc1 and some more significant improvements being prepared by power management maintainer Rafael Wysocki for likely inclusion in 4.7: the schedutil cpufreq governor.

The focus of these patch sets is just to begin the process of integration. They don't present a complete solution, but only a couple of steps in the right direction, by creating interfaces for linking the scheduler more closely with CPU power management.

CPU power management in Linux 4.5

Linux support for power management of CPUs roughly divides into cpuidle, which guides power usage when the CPU is idle, and cpufreq, which governs power usage when it is active. cpufreq drivers divide into those, like intel_pstate and the Transmeta longrun, that choose the CPU frequency themselves using hardware-specific performance counters or firmware functionality, and those that need to be told which power level, from a given range of options, is appropriate at any given time. This power level is nominally a CPU frequency but, as it can be implemented by scaling a high-frequency clock, it is sometimes referred to as a frequency scaling factor. It is not uncommon to scale down the voltage together with the frequency so you can also see terms like DVFS for "Dynamic Voltage and Frequency Scaling", or the more generic OPP for "Operating Power Point". For the most part, we will just refer to frequency changes here.

The cpufreq governors that choose a frequency for the second group of drivers are also divided into two groups: the trivial and the non-trivial. There are three trivial frequency governors: performance, which always chooses the highest frequency, powersave, which always chooses the lowest, and userspace, which allows a suitably privileged user-space process to make the decision by writing to a file in sysfs. The non-trivial governors vary the CPU frequency based on the apparent system load. This is quite similar to the approach used by intel_pstate, except that they determine load from generic information known to the kernel, rather than hardware-specific counters.

The statistics used by these governors do come from the scheduler, but only in a roundabout way that does not reflect the sort of integration that Molnar hopes for. The governors set a timer with a delay of some tens or hundreds of milliseconds and use the CPU-usage statistics (visible in /proc/stat) to determine the average load over that time, which is the ratio of busy time for the CPU to total time. There is some subtlety to exactly how this number is used but, roughly, the CPU frequency is increased when the load is above a threshold and decreased if it is below another threshold. The two non-trivial frequency governors differ in their response to increasing load: on-demand will immediately jump to the highest frequency and then possibly back off as the idle time increases, making it suitable for interactive tasks, while the conservative governor will scale up more gradually as is fitting for background jobs.

Problems with the current approach

There are various reasons for dissatisfaction with the current governors. Probably the easiest to identify is the one that caused the Android developers to write their own governor: interactive responsiveness. An idle CPU will likely be at the lowest frequency setting. When a user starts interacting with the device the CPU will continue at that setting for many milliseconds until the timer fires and the new load is assessed. While the on-demand governor does go straight to the maximum frequency, it doesn't do so immediately, and the delay is both noticeable and undesirable.

The interactive governor for Android is designed to detect this transition out of idle and to set a much shorter timeout, typically under ten milliseconds ("1-2 ticks"). A new frequency is then chosen based on the load over this short time range rather than the normal longer interval. The result is improved interactive response without the loss of stability that is likely if all samples were over short time periods.

Another reason for dissatisfaction is that resetting those timers frequently on every CPU is far from elegant. Thomas Gleixner, maintainer of the timer code, is known not to like them and Wysocki noted that "getting rid of those timers allows quite some irritating bugs in cpufreq to be fixed".

Finally, the information used to guide frequency choice is based entirely on recent history. While it is hard to get reliable information about the future, there is information about the present that can be useful. To understand this it is helpful to consider the different classes of threads as they are seen by the scheduler, which can be divided into realtime, deadline, or normal.

Of these, threads configured for deadline scheduling provide the most information about the future. Such threads must specify a worst-case CPU time needed to achieve their goal and a period indicating how often it will be needed. From this, it is possible to calculate the minimum CPU frequency that will allow all deadline threads to be serviced. This can be done without any reference to history.

Realtime threads also provide information about the future, though it is not so detailed. When a realtime thread is ready to run, "the only possible choice the kernel has", as Peter Zijlstra put it, "is max OPP". When there is a realtime thread, but it is currently waiting for something, there are two options. If switching to top speed can be done quickly, then it is safe to ignore the thread until it is runnable, and then instantly switch to maximum frequency as it starts running. If switching the CPU frequency takes a bit longer, then the only really safe thing to do is to stay at maximum CPU frequency whenever there are any realtime threads on that CPU.

Lastly, there are the normal threads, those managed by the Completely Fair Scheduler (CFS). While the primary source of information available to CFS is historical, it is more fine-grained than the information currently in use. Using per-entity load tracking, CFS knows the recent load characteristics of every thread on each CPU and knows immediately if a thread has exited, a new one has been created, or if a thread has migrated between CPUs. This allows a more precise determination of load, which is already being used for scheduling decisions and could usefully be used for CPU frequency scaling decisions.

CFS can have a little bit more information than historical usage. It is possible for a maximum CPU bandwidth to be imposed on a process or process group. CFS could use this to determine an upper limit for the load generated by those processes and may usefully be able to provide that information to cpufreq.

Challenges

Even if all this information were readily available, and some of it is, making use of it effectively is not necessarily straightforward. This particularly seems to be the case when working in and around the scheduler, as that code is quite performance-sensitive.

One challenge that cpufreq must face is how exactly to change the CPU frequency once a decision has been made. As was hinted at above, some platforms may allow fast frequency changes but, while that is true, cpufreq doesn't know anything about it. A cpufreq driver has a single interface, target_index(), to set the frequency, which may take locks or block for other reasons, so it cannot be depended on to be fast. An optional non-blocking interface has been proposed, but it will still be necessary to work with cpufreq drivers that need to be called from "process context". This currently means scheduling a worker thread to effect the frequency change.

There is a question whether a regular workqueue thread is really sufficient. It seems possible that on a busy system, such a thread might not get scheduled for a while, so at those times, when a switch to high frequency is most important, it will be most delayed. The interactive governor used by Android has addressed this issue by creating a realtime thread to perform all frequency changes. While this works, it does not seem like an ideal solution for the longer term. As noted above, it may be sensible to keep the CPU at maximum speed whenever there are realtime threads. In that case, the existence of a realtime thread for setting the frequency would prevent it from ever being asked to set a lower frequency.

This is currently an open issue. Wysocki is not convinced that this is a real problem in practice, so he is not keen on anything beyond the current simple approach. Those who feel otherwise will need to provide concrete evidence of a problem, which is often a valuable prerequisite for such changes.

Two steps forward

There are two patch sets that have been prepared by Wysocki and look likely to reach the mainline soon; one has already been included in 4.6-rc1. There are a number of details in which they differ from others that have been proposed, such as scheduler-driven frequency selection originally by Michael Turquette, but possibly the most important is that they are incremental steps with modest goals. They address the issues that can easily be addressed and leave other more difficult issues for further research.

The first of these patch sets removed the timers that Gleixner found so distasteful. Instead of being triggered by a timeout, re-evaluation of the optimal setting for cpufreq is now triggered by the scheduler whenever it updates load or runtime statistics. This will doubtless be called more often than frequency changes are really wanted, and possibly more often than it is possible to update the frequency choice, so the sampling_rate number that previously set the timer is now used to discard samples until an appropriate interval since the last update has passed.

The new frequency choice is still calculated the same way and it happens about as often, but one noteworthy change is that the updates are no longer synchronized with the scheduler "tick" that timers use. This change has resulted in one measurable benefit for the intel_pstate driver, which occasionally made poor decisions due to this synchronization.

The second patch set, some of which is up to its eighth revision, makes two particular changes. A new, optional fast_switch() interface is added to cpufreq drivers so that fast frequency switching can be used on platforms that support it. If provided, this must be able to run from "interrupt context" meaning that it must be fast and may not sleep. As already discussed, there are times when this can be quite valuable.

The other important change is to introduce a new CPU frequency-scaling governor, schedutil, that makes decisions based on utilization as measured by the scheduler. The cpufreq_update_util() call that the scheduler makes whenever it updates the load average already carries information about the calculated load on the current CPU, but no governor uses that information. schedutil changes that. It doesn't change much though.

schedutil still only performs updates at the same rate as the current code, so it doesn't try to address the interactive responsiveness problem, and doesn't try to be clever about realtime or deadline threads. All it does is use the load calculated by the scheduler instead of the average load over the last little while, and optionally imposes that frequency change instantly (directly from the scheduler callback) if the driver supports it.

This is far from a complete solution for power-aware scheduling, but looks like an excellent base on which to make cpufreq more responsive to sudden changes in load, and more aware of some of the finer details that the scheduler can, in theory, provide.

It appears that the long-term goal is to get rid of the selectable governors completely and just have a single governor that handles all cases correctly. It would need to respond correctly to realtime tasks, deadline tasks, interactive tasks, and background tasks. These are all concepts that the scheduler must already deal with, so it is quite reasonable to expect that cpufreq can learn to deal with them too. It will clearly take a while longer to reach the situation that Molnar desires, but it seems we are well on the way.

Comments (1 posted)

Early packet drop — and more — with BPF

By Jonathan Corbet
April 6, 2016
The Berkeley packet filter (BPF) mechanism has been working its way into various kernel subsystems since it was rewritten and extended in 2014. There is, it turns out, great value in an in-kernel virtual machine that allows for the implementation of arbitrary policies without writing kernel code. A recent patch set pushing BPF into networking drivers shows some of the potential of this mechanism — and the difficulty of designing its integration in a way that will stand the test of time. If it is successful, it may change the way high-performance networking is done on Linux systems.

Early drop

This patch set from Brenden Blanco is, in one sense, a return to the original purpose of BPF: selecting packets for either acceptance or rejection. In this case, though, that selection is done at the earliest possible moment: in the network-adapter device driver, as soon as the packet is received. The intent is to make the handling of packets destined to be dropped as inexpensive as possible, preferably before doing any protocol-processing work, such as setting up a sk_buff structure (SKB), for those packets.

BPF programs, as loaded by the bpf() system call, have a type associated with them; that type is checked before a program can be loaded for a specific task. Brenden's patch set starts by defining a new type, BPF_PROG_TYPE_PHYS_DEV, for programs that will do early packet processing. Each program type includes a "context" for information that is made available when the program runs; in this case, the context needs to include information about the packet under consideration. Internally, that context is represented by struct xdp_metadata; it contains only the length of the packet in this version of the patch set.

The next step is to add a new net_device_ops method that drivers can supply:

	int (*ndo_bpf_set)(struct net_device *dev, int fd);

A call to ndo_bpf_set() tells the driver to install the BPF program indicated by the provided file descriptor fd; the program should replace the existing program, if any. A negative fd value means that any existing program should be removed. There is a new netlink operation allowing user space to set a program on a given network device.

The driver can use bpf_prog_get() to get a pointer to the actual BPF program from the file descriptor. When a packet comes in, the BPF_PROG_RUN() macro can be used to run the program on the packet; a non-zero return code from the program indicates that the packet should be dropped.

Just a starting point

The interface for the running of the BPF program is where the disagreement starts. The driver must clearly give information about the new packet to the program being run; that is done by passing an SKB pointer to BPF_PROG_RUN(). The internal machinery hides the creation of the xdp_metadata information from the passed-in SKB. The mechanism seems straightforward enough, and it takes advantage of the existing BPF functionality for working with SKBs, but there are a couple of objections. The first of those is that the whole point of the early-drop mechanism is to avoid the overhead of packet processing on packets that will be dropped anyway; the initial, and not insignificant, part of that overhead is the creation of the SKB structure. Creating it anyway would appear to be defeating the purpose.

In truth, the one driver (mlx4) that has been modified to implement this mechanism doesn't create a full SKB; instead, it puts the minimal amount of information into a fake, statically allocated SKB. That avoids the overhead, but at the cost of creating an SKB that isn't really an SKB. The amount of information that needs to go into this fake SKB will surely grow over time — there is surprisingly little call for the ability to drop packets using their length as the sole criterion. Whenever new information is needed, every driver will have to be tweaked to provide it, and, over time, the result will look increasingly like a real SKB with the associated overhead.

The other potential problem is that there is a fair amount of interest in eventually pushing the BPF programs (possibly after a translation pass) into the network adapter itself. That would allow packets to be dropped before they come to the kernel's attention at all, optimizing the process further. But the hardware is not going to have any knowledge of the kernel's SKB structure; all it can see is what is in the packet itself. If BPF programs are written to expect an SKB to be present, they will not work when pushed into the hardware.

There is a larger issue, though: quickly dropping packets is a nice capability, but high-performance networking users want to do more than that. They would like to be able to load BPF programs to do fast routing, rewrite packet contents at ingress time, perform decapsulation, coalesce large packets, and so on. Indeed, there is a whole vision for the "express data path" (or "XDP") built around low-level BPF packet processing; see these slides [PDF] for an overview of what the developers have in mind. In short, they want to provide the sort of optimized processing performance that attracts users to user-space networking stacks while retaining the in-kernel stack and all its functionality.

If the mechanism is to be extended beyond drop/accept decisions, the information and functionality available to BPF programs will clearly have to increase, preferably without breaking any existing users. As Alexei Starovoitov put it: "We have to plan the whole project, so we can incrementally add features without breaking abi". The current patch set does not reflect much planning of this type; it is, instead, a request-for-comments posting introducing the mechanism that the XDP developers want to build on.

So, clearly, this code will not be going into the mainline in its current form. But it has had the desired effect of getting the conversation started; there is, it would seem, a lot of interest in adding this feature. If the XDP approach is able to achieve its performance and functionality goals, it should give user-space stacks a run for their money. But there is some significant work to be done to get to that point.

Comments (2 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 4.6-rc2 ?
Sebastian Andrzej Siewior v4.4.6-rt13 ?
Kamal Mostafa Linux 4.2.8-ckt7 ?
Kamal Mostafa Linux 3.19.8-ckt18 ?
Ben Hutchings Linux 3.2.79 ?

Architecture-specific

Core kernel code

Device drivers

Device driver infrastructure

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2016, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds