Kernel development

Brief items

Kernel release status

The current development kernel is 4.1-rc7, released on June 7. "Normally rc7 tends to be the last rc release, and there's not a lot going on to really merit anything else this time around. However, we do still have some pending regressions, and as mentioned last week I also have my yearly family vacation coming up, so we'll have an rc8 and an extra week before 4.1 actually gets released."

Stable updates: 4.0.5, 3.14.44, and 3.10.80 were released on June 6.

Comments (none posted)

Quotes of the week

And naming matters: a good name is both descriptive and short, as we know it from the excellent examples of 'Linux' and 'Git'. Oh wait ...

— Ingo Molnar

Why not go the full monty and call it strtrtsstrrrtst(), with strtrtssstrrtst() doing almost, but not quite the same thing?

— Al Viro

Comments (6 posted)

Huston: Multipath TCP

Geoff Huston has written a lengthy column on multipath TCP. "For many scenarios there is little value in being able to use multiple addresses. The conventional behavior is where each new session is directed to a particular interface, and the session is given an outbound address as determined by local policies. However, when we start to consider applications where the binding of location and identity is more fluid, and where network connections are transient, and the cost and capacity of connections differ, as is often the case in todays mobile cellular radio services and in WiFi roaming services, then having a session that has a certain amount of agility to switch across networks can be a significant factor." (See also: LWN's look at the Linux multipath TCP implementation from 2013).

Comments (2 posted)

Kernel development news

The difficult task of doing nothing

By Jonathan Corbet
June 9, 2015

LinuxCon Japan

Kristen Accardi started her LinuxCon Japan session with the claim that idle is the most important workload on most client systems. Computers placed in offices are busy less than 25% of the time; consumer systems have even less to do. So idle performance, especially with regard to power consumption, is important. The good news is that hardware engineers have been putting a lot of work into reducing the power consumption of idle systems; the bad news is that operating systems are often failing to take full advantage of that work.

In the "good old days," Kristen said, power management was relatively easy — and relatively ineffective. The "Advanced Power Management" (APM) mechanism was entirely controlled by the BIOS, so operating systems didn't have to pay much attention to it. Intel's "SpeedStep" offered exactly one step of CPU frequency scaling. The operating system could concern itself with panel dimming on laptop systems. That was about the extent of the power-management capabilities provided by the hardware at that time.

With the rise of the mobile market, though, power management started to get more complicated. ACPI was introduced, putting more power-management work into the operating system's domain. With ACPI came the notion of "S-states" (for system-wide power-management states), "C-states" (for CPU idle states), and "P-states" (for performance-level states — frequency and voltage scaling). There can be up to 25 of these states.

But things do not stop there; in recent years there has been an explosion of power-management features. They have names like SOix (a new low-power state) and PSR ("panel self refresh"). All of these features must be understood by the operating system, and all must work together for them to be effective.

Degrees of idleness

There are, Kristen said, three fundamental degrees of idleness in a system, differing in the amount of power they use and the time it takes to get back to an active state. The level with the lowest power consumption is "off." That is an increasingly uninteresting state, though; many consumer devices no longer have an "off" switch at all. Operating system support for the "off" state tends to be easy, so there wasn't much to talk about there.

The other two states are "suspend" and "runtime idle". A suspended system is in an intermediate state between running and off; runtime idle is closer to a running system with nearly instant response when needed. The support for the two states in the kernel is entirely different in a number of ways. Suspend is a system-wide state initiated by the user, while runtime idle is a device-specific state that happens opportunistically. In a suspended system, all tasks are frozen and all devices are forced into the idle state; with runtime idle, instead, tasks are still scheduled and devices may be active. Suspend can happen at any time, while runtime idle only comes into play when a device is idle.

Device drivers must support these two states separately; it is more work, but it's important to do. But platform-level support is also important. In current times, everything is a system-on-chip (SoC) with a great many interactions between components. If one of those components is active, it can keep the entire system out of the lower-power states.

To see how that can come to pass, consider the "latency tolerance reporting" (LTR) mechanism built into modern buses. Any device on the bus can indicate that it may need the CPU's attention within a given maximum time (the latency tolerance). The CPU, in turn, maintains a table describing the amount of time required to return to active operation from each of its idle states. When the CPU is ready to go into a low-power state, the latency requirements declared by active devices will be compared against that table to determine the lowest state that the CPU can go into. So, if a device is running and declaring a tight latency tolerance, it can prevent the CPU from entering a deep-idle state.

Where the trouble lies

Kristen then gave a tour of the parts of the system that are, in her experience, particularly likely to trip things up. At the top of the list was graphics processors (GPUs); these are complex devices and it tends to take quite a while to get power management working properly on them. The "RC6" mechanism describes a set of power states for GPUs; naturally, one wants the GPU to be in a low-power state when it doesn't have much to do. Beyond that, framebuffer compression can reduce memory bandwidth use depending on what's in the framebuffer; sending less video data results in lower power usage. Kristen suggested that users choose a simple (highly compressible) background image on their devices for the best power behavior. "Panel self refresh" allows the CPU to stop sending video data to the screen entirely if things are not changing; it can be inhibited by things like animated images on the screen.

Another "problem child" is audio. On many systems, audio data can be routed through the GPU, preventing it from going into an idle state. Audio devices tend to be complex, consisting of, at a minimum, a controller and a codec; drivers must manage power-management states for both of those devices together.

On the USB side, the USB 3.0 specification added a number of useful power-management features. USB 2.0 had a "selective suspend" feature, but it adds a lot of latency, reducing its usefulness. In 3.0, the controller can suspend itself, but only if all connected devices are suspended. The USB "link power management" mechanism can detect low levels of activity and reduce power usage.

There are three power-management technologies available for SATA devices. The link power management mechanism can put devices into a sleep state and, if warranted, turn the bus off entirely. "ZPODD" is power management for optical devices, but Kristen has never seen anybody actually use it; optical devices are, in general, not as prevalent as they once were. The SATA controller itself offers some power-management features, but they tend to be problematic, she said, so they are not much used in Linux.

The PCI Express bus has a number of power-management options, including ASPM for link-level management, RTPM as a runtime power-management feature, and latency tolerance reporting. The I2C bus has fewer features, being a simpler bus, but it is usually possible to power-down I2C controllers. Human-input devices, which are often connected via I2C, tend to stay powered up while they are open, which can be a problem for system-wide power management.

And, of course, software activity can keep a system from going into deep idle states. If processes insist on running, the CPU will stay active, leaving suspend as the only viable option for power savings. Even brief periods running in the CPU can, if they cause it to wake from idle often, significantly reduce battery life.

Idle together

The conclusion from all of this is that power management requires a coordinated effort. For a system to go into a low-power state, a number of things must happen. User space must be quiet, the platform must support low-power states across all devices, and the kernel must properly support each device's power-management features. The system must also be configured properly; Kristen expressed frustration at mainstream distributions that fail to set up proper power management at installation time, wasting the effort that has been put into power-management support at the lower levels. Getting all of the pieces to work together properly can be a difficult task, but the result — systems that efficiently run our most important workload — is worth the trouble.

[Your editor would like to thank the Linux Foundation for funding his travel to LinuxCon Japan]

Comments (3 posted)

Enforcing mount options for sysfs and proc

By Jake Edge
June 10, 2015

The sysfs and proc filesystems do not contain executables, setuid programs, or device nodes, but they are typically mounted with flags (i.e. noexec, nosuid, and nodev) that disallow those types of files anyway. Currently, those flags are not enforced when those filesystems are mounted inside of a namespace (i.e. the mount will succeed without those flags being specified). Furthermore, a sysfs or proc filesystem that has been mounted read-only in the host can be mounted read-write inside a namespace, which is a bigger problem. The others are subtle security holes, or they could be, so Eric Biederman is trying to close them, though it turns out that the fixes break some container code.

In mid-May, Biederman posted a patch set meant to address the problems, which boil down to differences in the behavior of mounting these filesystems inside a container versus bind-mounting them. In the latter case, any restrictions the administrator has placed on the original proc and sysfs mounts will be enforced on the bind mounts. If, instead, those filesystems are directly mounted inside of a user namespace, those restrictions won't be enforced. The problem is largely moot, at least for now, for executables, setuid programs, and device nodes, but that is not so for the read-only case. If the administrator of the host has mounted /proc as read-only, a process running in a user namespace could mount it read-write to evade that restriction, which is clearly a problem.

But Biederman was well aware that he might be breaking user-space applications by making this change. In particular, he was concerned about Sandstorm, LXC, and libvirt LXC, all of which employ user namespaces. So he put out the patches for testing (and comment).

That led to two reports of breakage, the first from Serge Hallyn about a problem he found using the patches with LXC. The LXC user space was not passing the three flags that restrict file types allowed for sysfs, which caused the mount() call to fail with EPERM due to Biederman's changes. The fix for LXC is straightforward but, as Andy Lutomirski pointed out, Biederman's change is an ABI break for the kernel. Given that there aren't executables or device nodes on sysfs or proc, dropping enforcement of those flags from the patch would not have any practical effect, Lutomirski argued.

Sandstorm lead Kenton Varda suggested that instead of returning EPERM, mount() should instead ignore the lack of those flags when the caller has no choice in the matter:

That is, in cases where mount() currently fails with EPERM when not given, say, MS_NOSUID, it should instead just pretend the caller actually set MS_NOSUID and go ahead with a nosuid mount. Or put another way, the absence of MS_NOSUID should not be interpreted as "remove the nosuid bit" but rather "don't set the nosuid bit if not required".

As Varda noted, that would fix the problem without LXC needing to change its code. He also thought it would be less confusing than getting an EPERM in that situation. Neither Biederman nor Lutomirski liked the implicit behavior that Varda suggested, however.

It turns out that libvirt LXC has a similar problem, as reported by Richard Weinberger. It is mounting /proc/sys, but not preserving the mount flags from /proc in the host, thus the mount() was failing. Once again, there is a simple fix.

Lutomirski suggested removing the noexec/nosuid/nodev part, but keeping the read-only enforcement, to avoid the ABI break. Biederman disagreed with that approach. It may not matter now that proc and sysfs are mounted that way, but it has mattered in the past and could again in the future:

So I am leaning towards enforcing all of the mount flags including nosuid, noexec, and nodev. Then when the next subtle bug in proc or sysfs with respect to chmod shows up I will be able to sleep soundly at night because the mount flags of those filesystems allow a mitigation, and I did not [sabotage] the mitigation.

Plus contemplating code that just enforces a couple of mount flags but not all of [them] feels wrong.

He did want to avoid breaking LXC and libvirt LXC, though, at least until those programs could be fixed and make their way out to users over the next few years. So Biederman added a patch that relaxed the requirement for noexec and nosuid (nodev turns out to be a non-issue due to other kernel changes), but printed a warning in the kernel log. Since it is a security fix (though not currently exploitable), he targeted the stable kernels with the fix too. However, Greg Kroah-Hartman pointed out that adding warnings for things that have been working just fine is not acceptable in stable kernels.

Though others disagree, Biederman does not see his changes as breaking the ABI. They do cause a behavior change and break two user-space programs (at least that are known so far), however. He would prefer not to break those programs, so the warning is kind of a stop-gap measure, he argued. The changes are fixing security holes, though, even if it appears they are not exploitable right now:

Given that I have not audited sysfs and proc closely in recent years I may actually be wrong. Those bugs may actually be exploitable. All it takes is chmod to be supported on one file that can be made executable. That bug has existed in the past and I don't doubt someone will overlook something and we will see the bug again in the future.

As it stands, the changes will still allow current LXC and libvirt LXC executables to function (though the version targeting the mainline will warn about that kind of use). Biederman plans to get it into linux-next, presumably targeting 4.2. After that, he plans to remove the warning and enforce the mount options in a subsequent kernel release. It is a bit hard to argue that either of the two broken programs were actually doing what their authors intended in the mount() calls, even though it worked. Assuming no other breakage appears, that might be enough to get this patch added without triggering Linus Torvalds's "no regression" filter.

Comments (none posted)

Obstacles to contribution in embedded Linux

By Jonathan Corbet
June 9, 2015

LinuxCon Japan

Tim Bird has worked with embedded Linux for many years; during this time he has noticed an unhappy pattern: many of the companies that use and modify open-source software are not involved with the communities that develop that software. That is, he said, "a shame." In an attempt to determine what is keeping companies from contributing to the kernel in particular, the Consumer Electronics Linux Forum (a Linux Foundation workgroup) has run a survey of embedded kernel developers. The resulting picture highlights some of the forces keeping these developers from engaging with the development community and offers some ideas for improving the situation.

The problem, Tim said, is not small. A typical system-on-chip (SoC) requires 1-2 million lines of out-of-tree code to function. Keeping that code separate obviously hurts the kernel, but it is also painful for the companies involved. There is a real cost to carrying that much out-of-tree code. Sony (where Tim works), for example, was managing 1,800 patches for each phone release — and that was just kernel code.

Fixing this problem requires engagement between embedded developers and the development community. "Engagement" means more just using the code and complying with its licensing; it means explaining requirements to the community, interacting in the community's forums, and merging changes upstream. A lot of companies still find that idea scary — they don't want to be producing code that will be used by their competitors. That is how our community works, though.

Obstacles

The idea behind the survey (conducted in September 2014) was to identify the obstacles to engagement with the community. So the group set out to locate developers who are not contributing code upstream and figure out why that isn't happening. The top reasons turned out to be:

Reason % agreed

Developing against an older kernel 54%

Work depends on out-of-tree code 50%

It's too hard 45%

Unable to test 41%

Employer does not allow time 40%

Patch not good enough 35%

Afraid of rejection 33%

Reason	% agreed
Developing against an older kernel	54%
Work depends on out-of-tree code	50%
It's too hard	45%
Unable to test	41%
Employer does not allow time	40%
Patch not good enough	35%
Afraid of rejection	33%

One thing that is not lacking is the desire to contribute: 92% of developers said that they think upstreaming code is important. But 21% of them said that management explicitly disapproves of such work, and 40% said that their management did not allow the necessary time. Issues that didn't matter included the need to use English (only 9% agreed) or developers feeling that upstreaming isn't their responsibility (6%). A bit more significant was the 26% of developers who said that their company's internal processes made contribution too hard.

At the top of the list of obstacles was "version gap" — developing code against an older kernel. Companies developing kernel code tend to start with whatever kernel was provided by the SoC vendor rather than the mainline; this is especially true for Android products. A typical Android kernel starts at Google, passes to the SoC vendor, and finally ends up at the equipment manufacturer; by that time, it is far behind the mainline.

As an example, Tim mentioned the Sony mobile kernel, which started as the 3.4 release when Google settled on it. By now, this kernel is 16 releases and three years behind the mainline. Some 26,000 commits, adding up to 1.8 million lines, have been added to Sony's kernel. The mainline kernel, of course, is in a state of constant change. Distance from the mainline makes contribution harder, since the kernel community needs patches against current kernels. As a result, it is unlikely that those 26,000 changes to a 16-release-old kernel can be easily upstreamed.

Another problem is the perceived difficulty of contributing to the kernel; to many, the process seems cumbersome and fraught with pitfalls. There are some documents on how to contribute, but those documents do not fully cover the community's social issues, timing constraints, procedures, and more. That said, Tim noted that some things are getting better; the availability of tools to find trivial issues with patches is helpful.

Kernel subsystem maintainers tend to be strict and terse with contributors, mostly as a result of overload; they simply don't have the time to fix up problematic patches. If a contributor gets a reputation for submitting bad patches and wasting maintainer time, their life will get worse. Silly mistakes can cause a patch to be quickly rejected or ignored. The problem here is that embedded developers are not full-time contributors, so they have a hard time maintaining a proficiency in dealing with the process. It is also hard for them to respond to requests for changes after their project has moved on.

Adding to the problem is the fact that much embedded code simply is not at the required level of quality for upstreaming. This code is often low-quality and highly specialized; it features the sort of workarounds and quick hacks that tend to get a poor reception on the kernel mailing lists.

Dependencies on other out-of-tree code also make things worse. Tim raised the example of the out-of-tree Synaptics touchscreen driver; Sony had developed a number of fixes and enhancements for this driver, but Synaptics had no interest in taking them. So where should these patches go? It is, he noted, not at all fun to be in the business of mainlining a supplier's code.

Developers complained that management does not give them the time they need to work with upstream communities. Product development teams work on tight schedules with the object of producing a "good enough to ship" solution. This is true throughout the industry; it is not unique to open-source software. These developers have no time to respond to change requests from maintainers. The kernel community, meanwhile, is not interested in any particular company's product timelines.

Things get worse, of course, when companies wait until late in the development process to release their code — something everybody does, according to Tim. When it comes time to mainline the code, the developers discover that major changes will be needed, which is discouraging at best. It would be far better to get input from the community or an internal expert early in the process.

Overcoming the obstacles

If a company wants to overcome version-gap problems, Tim said, the best place to start is to get current mainline kernels running on the hardware it is working with. One development team can then work on the mainline, while product engineers work with whatever ancient kernel is intended for the shipping product. The two-team approach can also help with the product-treadmill problem; if a small team is dedicated to mainlining the company's code, it can operate independently of the deadlines that drive the product engineers.

Companies should employ internal mentors and train developers to work with the wider community. Tim also stressed that it is important to use the community's processes internally. When working with kernel code, keep patches in Git; using a tool like Perforce to manage patches will not ease the task of engaging with the community.

With regard to low-quality code, Tim admitted that he had no silver bullet to offer. We all have to do hacks sometimes. The best that can be done is to examine the code to determine whether it should be maintained going forward; it should be kept in Git and reviewed (and possibly improved) at forward-porting time.

To avoid the highly specialized code problem entirely, don't use specialized hardware if at all possible. Manufacturers should, Tim said, require mainline drivers from their suppliers. The cost of software development should be figured into the bill of materials when the hardware is selected. Figuring in the cost of out-of-tree drivers is "an important next step" for the industry.

Unfortunately, companies following Tim's advice are likely to run into a new issue, which he called the "proxy problem." Having a special team dedicated to mainlining code can ease interactions with the community, but it also creates a situation where the community-facing developers are not the subject-matter experts. When they try to upstream code that they did not write, they cannot quickly answer questions or test changes. There is no avoiding the need for the original developers to help the proxies in situations like this.

Why bother?

Tim closed out the session by asking: why should companies bother with upstreaming their code in the first place? He pointed out that Sony has 1,100 developers who have made patches to its kernels in the last five years; many of them are applying the same patches over and over again. Sony would like to decrease the amount of time going into that sort of activity; mainlining its changes is the obvious way to do that.

Getting code upstream has a significant financial benefit for companies: it reduces the maintenance cost of that code and allows others to share the work. Even more importantly, having code upstream can reduce the time to market for future projects. Going through the community process improves the quality of the code. It can also fend off the need to migrate over to a competing company's implementation in the future. Finally, upstreaming is a reward for developers; it is something they want to do, and it will turn them into better developers.

These are all completely selfish reasons, Tim said; they are an entirely sufficient justification for working upstream even without getting into the ethical issues.

To further this effort, the "device mainlining project" is working within the CE Linux Forum. This project will continue to analyze the obstacles to contribution and attempt to address them by promoting best practices, collecting training materials, and publishing code analysis tools. There is also work ongoing to justify community engagement to management and quantify the costs of using out-of-tree code. This group will have its next meeting at the Embedded Linux Conference Europe in October.

[Your editor would like to thank the Linux Foundation for funding his travel to LinuxCon Japan]

Comments (70 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 4.1-rc7 ?

Greg KH Linux 4.0.5 ?

Greg KH Linux 3.14.44 ?

Greg KH Linux 3.10.80 ?

Architecture-specific

Vineet Gupta ARCv2 port to Linux - (B) ISA / Core / platform support ?

Jun Nie Add basic ZTE zx296702 SOC support ?

Huang Rui x86, mwaitt: introduce AMD mwaitt support ?

Josh Poimboeuf x86/asm: Compile-time asm code validation ?

Core kernel code

Vikas Shivappa New cpumask API and Intel Cache Allocation support ?

Peter Zijlstra sched: balance callbacks ?

Aleksa Sarai cgroup: add PIDs subsystem ?

Petr Mladek kthreads/signal: Safer kthread API and signal handling ?

Tom Zanussi tracing: 'hist' triggers ?

Device drivers

tthayer@opensource.altera.com Add Altera Arria10 EDAC Support ?

Suneel Garapati Support for CEVA SATA Host controller ?

Pali Rohár Dell Airplane Mode Switch driver ?

Pi-Cheng Chen Add Mediatek MT8173 cpufreq driver ?

Sudeep Holla ARM64: juno: add SCPI mailbox protocol, clock and CPUFreq support ?

Matthew R. Ochs cxlflash: Base support for IBM CXL Flash Adapter ?

Feng Wu Add VT-d Posted-Interrupts support - IOMMU part ?

fu.wei@linaro.org Watchdog: introduce ARM SBSA watchdog driver ?

Koro Chen ASoC: Mediatek: Add support for MT8173 SoC ?

Madalin Bucur Freescale DPAA FMan ?

Daniel Thompson clk: stm32: Add clock driver for STM32F4[23]xxx devices ?

Device driver infrastructure

Dan Williams pmem api, generic ioremap_cache, and memremap ?

Dan Williams introduce __pfn_t, evacuate struct page from sgls ?

Allen Hubbe NTB: Add NTB hardware abstraction layer ?

Filesystems and block I/O

Matias Bjørling Support for Open-Channel SSDs ?

Tejun Heo block, cgroup: make cfq charge async IOs to the appropriate blkcgs ?

Arianna Avanzini block, cgroup: implement policy-specific per-blkcg data ?

Stefan Hajnoczi NFS: add AF_VSOCK support to NFS client ?

Ming Lei block: loop: improve loop with AIO ?

Memory management

Xishi Qiu mm: mirrored memory support for page buddy allocations ?

Sergey Senozhatsky introduce automatic pool compaction ?

Mel Gorman Move LRU page reclaim from zones to nodes ?

Wenwei Tao add defer mechanism to ksm to make it more suitable for Android devices ?

Eric B Munson Allow user to request memory to be locked on page fault ?

Michal Hocko panic_on_oom_timeout ?

Networking

Tom Herbert net: Identifier Locator Addressing ?

Shrijeet Mukherjee Proposal for VRF-lite ?

Security-related

Tadeusz Struk crypto: Introduce Public Key Encryption API ?

Tycho Andersen seccomp: add ptrace options for suspend/resume ?

Konstantin Khlebnikov pagemap: make useable for non-privilege users ?

Virtualization and containers

Juergen Gross xen: support pv-domains larger than 512GB ?

Miscellaneous

Andi Kleen perf, tools: Add support for uncore events and updated Intel events ?

Wang Nan perf tools: filtering events using eBPF programs ?

Lucas De Marchi kmod 21 ?

Page editor: Jonathan Corbet
Next page: Distributions>>