Kernel development
Brief items
Kernel release status
The current development kernel is 3.16-rc1, released on June 15. Linus said:
In the end, 11,364 changesets were pulled in during the merge window (3.15-rc1 had 12,034).
Stable updates: 3.15.1, 3.14.8, 3.10.44, and 3.4.94 all came out on June 16. The list of fixes is relatively short this time, but they certainly look worth having.
Quote of the week
Anyway, I am heading out to the gym. Perhaps a few gym sessions and a couple of sleep cycles will convert the painting to useful code. This approach has worked well for me many times over the decades, so here is hoping that it does again.
Kernel development news
The 3.16 merge window concludes
On June 15, Linus Torvalds put out the 3.16-rc1 prepatch and closed the merge window for this cycle. From here on in, features are unlikely to be added as fixes and stabilization patches should predominate.
At this point, Torvalds has merged 11,364 non-merge commits for 3.16. That's around 3,200 since last week's look (and a total of nearly 6,000 since part 1 of our merge window coverage). The last two merge windows have been two of the top three windows in terms of commits, with 3.16 in third place behind 3.10 (11,963) and 3.15 (12,034). We will have to see if the trend continues and we get 11,000–12,000 patches for 3.17 and beyond.
In any case, here are some of the more significant changes that users will see in 3.16:
- Modules now have the read-only (RO) and no-execute (NX) bits set on their data sections much earlier in the loading process, before parsing any module arguments. This will further reduce the time window in which a misbehaving (or malicious) module can modify or execute its data.
- The secure computing (seccomp) BPF filters are now just-in-time (JIT) compiled.
- Support for TCP fast open over IPv6 has been added.
- The Xen virtual network interfaces now have multi-queue support, which provides much better performance.
- Support for busy polling on stream control transmission protocol (SCTP) sockets has been added. Busy polling is set on a socket using the SO_BUSY_POLL socket option; it can reduce the latency of receives on high-traffic interfaces that support the option.
- The extended verification module (EVM) has added configuration options to support putting new extended attributes (xattrs) into the calculated HMAC value for a file. Using that facility, three Smack attributes (SMACK64EXEC, SMACK64TRANSMUTE and SMACK64MMAP) can now be added into the HMAC calculation.
- Btrfs has added a new ioctl() called BTRFS_IOC_TREE_SEARCH_V2 to search the filesystem for keys. As its name would imply, it is a more flexible version of the existing BTRFS_IOC_TREE_SEARCH that allows for a larger buffer to be passed in to retrieve larger search results that won't fit into the 3992-byte fixed-sized buffer.
- New hardware support:
- Graphics: The Nouveau driver now supports NVIDIA Tesla K40 GK110B devices and has initial support for NVIDIA Tegra K1 GK20A devices. ASPEED AST2400 devices.
- Miscellaneous: Freescale Quad Special Peripheral Interface (SPI) controllers; LPDDR2-NVM flash chips; Broadcom Kona pulse-width modulation (PWM) blocks; Intel system-on-chip (SoC) platform digital temperature sensors (DTSs); and Sensiron SHTC1 and SHTW1 humidity and temperature sensors.
- Network devices: Broadcom BCM7xxx set-top box SYSTEMPORT Ethernet MACs; STMicroelectronics ST21NFCA near field communication (NFC) controllers; Renesas R-Car SoC controller area network (CAN) controllers; Geschwister Schneider USB/CAN devices; Xilinx CAN devices; Hisilicon HIX5HD2 family network devices; and AMD SoC 10GbE Ethernet devices.
- Staging graduation: Freescale i.MX5/6 v3 image processing units (IPUv3).
Changes of interest to kernel developers include:
- A simple interval-tree interface has been added as lib/interval_tree.c. The interval tree is implemented as an augmented red-black tree.
- Tracepoints have been added to give finer resolution of events during suspend and resume.
- The BPF interpreter now has a self-test that covers both classic and internal BPF instructions.
- A software TCP segmentation offload (TSO) API has been added; several drivers have used it to add software TSO support (mvneta, mv643xx_eth, fec).
- The Documentation/mutex-design.txt file has been extensively updated to better reflect today's reality.
- Optimistic spinning has been added to read-write semaphores (rwsems). Also, a queued variant of read-write locks (qrwlocks) has been added.
- Two new methods, read_iter() and write_iter(), have been added to struct file_operations. They are intended to support the move toward using the iov_iter interface and are meant to eventually replace the aio_read() and aio_write() methods.
Now the stabilization phase for 3.16 begins. That means we are likely to see a final 3.16 release sometime in early August depending on how the cycle goes. Then it will be time to start the festivities all over again for 3.17.
The volatile volatile ranges patch set
"Volatile ranges" is a name given to regions of user-space memory that can be reclaimed by the kernel when memory is tight. The classic use case is for a web browser's image cache; the browser would like to keep that information in memory to speed future page loads, but it can do without that data should the memory used for the cache be needed elsewhere. Implementations of the volatile range concept have experienced more than the usual amount of change; that rate of change may well continue into the future — if a developer can be found to continue the work.Early versions of the patch set were based on the posix_fadvise() system call. Some developers complained that it was more of an allocation-related concept, so the patch was reworked to use fallocate() instead. By 2013, the plan had shifted toward the addition of two new system calls named fvrange() and mvrange(). Version 11, released in March 2014, moved to a single system call named vrange(). During all of these iterations, there have also been concerns about user-space semantics (what happens when a process tries to access a page that has been purged, in particular) and the best way to implement volatile ranges internally. So nothing has ever been merged into the mainline kernel.
Version 14, posted by John Stultz on April 29, changes the user-space API yet again. Volatile ranges have now shifted to the madvise() system call. In particular, a call to:
madvise(address, length, MADV_VOLATILE);
Will mark the memory range of length bytes starting at address as being volatile. Once the memory range has been marked in this way, the kernel is free to reclaim the associated pages and discard their contents at any time. Should the application need access to the range in the future, it should mark it as being nonvolatile with:
madvise(address, length, MADV_NONVOLATILE);
The return value is zero for success (the range is now nonvolatile and the previous contents remain intact), a negative number if some sort of error occurred, or one if the operation was successful but at least one of the pages has been purged.
The use of madvise() had been considered in the past; it makes sense, given that the purpose is to advise the kernel about the importance of a particular range of memory. Previous volatile range implementations, though, had the property that marking a range nonvolatile could fail partway through. That meant that the interface had to be able to return two values: (1) how many pages had been successfully marked, and (2) whether any of them had been purged. This time around, John found a way to make the operation atomic, in that it either succeeds or fails as a whole. In the absence of a need for a second return value, the madvise() interface is adequate for the task.
What happens if user space attempts to access a volatile page that has been purged by the kernel? This implementation will deliver a SIGBUS signal in that situation. A properly-equipped application can catch the signal and respond by obtaining the needed data from some other source; applications that are not prepared will litter the disk with unsightly core dumps instead. That may seem like an unfriendly response, but one can argue that an application should not be trying to directly access memory that, according to instructions it gave to the kernel, does not actually need to be kept around.
Minchan Kim does not like this approach; he would prefer, instead, that the application simply receive a new, zero-filled page in this situation. He is, it turns out, thinking about a slightly different use case: code that reuses memory and wants to tell the kernel that the old contents need not be preserved. In this case, the reuse should be as low-overhead as possible; Minchan would prefer to have no need for either an MADV_NONVOLATILE call or a SIGBUS signal handler. John suggested that Minchan's own MADV_FREE patch was better suited to that use case, but Minchan disagreed, noting that MADV_FREE is a one-time operation, while MADV_VOLATILE can "stick" to a range of memory through several purge-and-reuse cycles. John, however, worries that silently substituting zero-filled pages could lead to data corruption or other unpleasant surprises.
Johannes Weiner, who joined the conversation in June, also prefers that purged pages be replaced by zero-filled pages on access. He asked if the patch set could be reworked on top of MADV_FREE (which, he thinks, has a better implementation internally) to provide a choice: applications could request either the new-zero-filled-page or the SIGBUS semantics. John responded that he might give it a try, someday:
John certainly cannot be faulted for a lack of effort; this patch set has been through fourteen revisions since 2011; it has also been the subject of sessions at the Kernel Summit and Linux Storage, Filesystem, and Memory Management Summit. It has seen extensive revisions in response to comments from several reviewers. But, somehow, this feature, which has real users waiting for it to show up in a mainline kernel, does not seem much closer to being merged than before.
At the same time, it is hard to fault the reviewers. The volatile ranges concept adds new user-visible memory-management behavior with some subtle aspects. If the implementation and interface are not right, the pain will be felt by developers in both kernel and user space for a long time. Memory-management changes are notoriously hard to get into the kernel for a good reason; user-visible changes are even worse. This patch set crosses two areas where, past history shows, we have a hard time getting things right, so some caution is warranted.
Still, one can't help but wonder if merging nothing at all yields the best kernel in the long run. Users will end up working with out-of-tree variants of this concept (Android's "ashmem" in particular) that the development community has even less control over. Unless somebody comes up with the time to continue trying to push this patch set forward, the mainline kernel may never acquire this feature, leaving users without a capability that they demonstrably have a need for.
Teaching the scheduler about power management
Power-efficient CPU scheduling is increasingly important in the mobile world, but it has become just as important in large data-center settings, where the electricity bills can be painful indeed. Unfortunately, the kernel's infrastructure for CPU power management lacks integration with the scheduler itself, with the result that scheduling decisions are not as good as they should be. This article reviews the state of the CPU power-management mechanisms in the kernel and looks at what is being done to improve the situation.
A bit of history
A process scheduler is the core component of an operating system responsible for selecting which process to run next. The scheduler implementation in the Linux kernel has been through a couple of iterations — and even complete rewrites — over the years. The Completely Fair Scheduler (CFS), written by Ingo Molnar, was introduced in the 2.6.23 kernel. It replaced the O(1) scheduler which, in turn, was introduced in version 2.5.2 of the kernel, also by Ingo, replacing the scheduler implementation that existed before that. Despite all of the different algorithms, the general goal is always the same: to try to make the most of available CPU resources.
CPU resources have also evolved during this time. Initially, the scheduler's role was simply to properly manage processor time between all runnable processes. Increasing parallelism in hardware due to the emergence of SMP, SMT (or Hyper-threading), and NUMA added more twists to the problem. And, of course, the scheduler had to scale to an ever-increasing number of processes and processors in the same system without consuming too much CPU time on its own. These changes explain why multiple scheduler implementations have been developed over the last half-century and are still being worked on today. In the process, the scheduler has grown in complexity and only a few individuals have become experts in this area.
Initially, task scheduling was all about throughput with no regard for energy consumption; scheduler work was driven by the enterprise space, where everything was plugged into the wall. At the other end of the spectrum, we saw the emergence of battery-operated devices from the embedded and mobile space, where power management is a primary concern. Separate subsystems dealing with power management, such as cpuidle and cpufreq, were introduced and contributed to by a different set of developers with little scheduler experience. In due course, the power management subsystems grew in complexity as well with its own experts.
This split arrangement worked out reasonably well… at least initially. The isolation between the subsystems allowed for easier development and maintenance. With mobile devices growing in capabilities, as well as ever-increasing data-center electric bills, everyone started caring about energy efficiency. This brought about core kernel changes such as deferrable timers, dyntick, and runtime power management. The rise of multi-core portable devices pushed the need for yet more aggressive power-management tricks such as the controversial use of CPU offlining.
There is a pattern that emerges from these changes: the more complex scheduler and power management become, the more isolated they are from each other. And this turns out to be completely counterproductive, since, as we'll see later, one side can't predict what the other side might do in the near future. Because (or in spite) of that, some chip manufacturers are increasingly implementing DVFS in hardware away from the operating system, which exacerbates the problem. Yet support for ARM's big.LITTLE and the increasing influence scheduler decisions have on power consumption in general have made it clear that merging power management with the scheduler is becoming unavoidable.
Scheduler: meet cpuidle
The cpuidle subsystem tries to minimize power consumption by selecting a low-power mode, or idle mode (often referred as C-State), when the CPU is idle. However, idling a CPU comes with a price: the more power savings such a mode provides, the longer it will take for the affected CPU to become fully operational again. A good balance between the power actually saved and the time "wasted" in entering and exiting a power-saving mode has to be reached. Furthermore, many modes consume some non-negligible amount of power for the CPU simply to transition in and out of them, meaning the CPU has to be idle for a sufficiently long period of time for those modes to be worth entering. Most CPUs have multiple idle modes, providing different trade-offs between achievable power savings and latency.
Therefore, the cpuidle code has to gather statistics on actual CPU usage to select the most appropriate idle mode depending on the observed idleness pattern of the CPU. And this statistics-gathering work duplicates what the scheduler already does, albeit through indirect and somewhat imprecise heuristics.
Idleness patterns are determined by wake-up events bringing the CPU out of idle. Those events can be classified into three categories:
- Predictable events:
This group comprises all timers from which we can obtain their next
expiry time and deduce a minimum idle period.
- Semi-predictable events:
These are somewhat repetitive events, like I/O request completions,
that often follow a known pattern.
- Random events: Anything else, such as keystrokes, touchscreen events, network packets, etc.
By directly involving the scheduler in the idle-state selection process, we can do a much better job at considering the semi-predictable events. I/O patterns are mainly a function of those tasks generating them and the device they're directed to. The scheduler can therefore keep track of the average I/O latency on a per-task basis and, possibly, with some input from the I/O scheduler, provide an estimated delay for the next I/O completion to occur according to the list of waiting tasks on a given CPU. And if a task is migrated to a different CPU, its I/O latency statistics are migrated along. The scheduler is therefore in a better position to appreciate the actual idleness of a CPU.
It is therefore necessary for the scheduler and cpuidle to become better integrated, to let the scheduler manage the available idle modes and eventually bypass the current cpuidle governors entirely. Moving the main idle loop into the scheduler code will also allow for better accounting of CPU time spent in the servicing of interrupts and their occurrence rate while idle.
Furthermore, the scheduler should be aware of the current idle mode on each CPU to do a better job at load balancing. For instance, let's consider the function find_idlest_cpu() in kernel/sched/fair.c, which looks for the least-loaded CPU by comparing the weighted CPU load value for each CPU. If multiple CPUs are completely idle, their load would be zero, with no distinction for the idle mode they're in. In this case, it would be highly beneficial to choose the CPU whose current idle mode has the shortest exit latency. If idle exit latency is the same for all idle CPU candidates then the last to have entered idle mode is more likely to have a warmer cache (assuming the relevant idle mode preserves cache, of course). An initial patch series to that effect was posted by Daniel Lezcano.
This also highlighted the fact that some definitions for the same expression may differ depending on one's perspective. A function called find_idlest_cpu() in the scheduler context is simply the converse of find_busiest_cpu(), whereas in the cpuidle context this would mean looking for the CPU with the deepest idle state. The deeper an idle state is, the more costly it is to bring a CPU back to operational state — clearly not what we want here. A similar confusion may occur with the word "power". The traditional meaning in the scheduler code is "compute capacity" while it means "energy consumption rate" in a power management context. Patches to clarify this have recently been merged.
Scheduler: meet cpufreq
The scheduler keeps track of the average amount of work being done by scheduled tasks on each CPU in order to give each task fair access to CPU resources and to decide when to perform load balancing. The ondemand cpufreq governor does similar load tracking in order to dynamically set each CPU's clock frequency to optimize battery life. Since energy consumption is proportional to the square of the voltage, it is desirable to run at the lowest clock frequency, which allows for voltage reduction to the CPU, while still being fast enough to perform all the scheduled work during a given period of time.
As with cpuidle, the cpufreq subsystem was developed in isolation from the CPU scheduler. Many problems result from the split between those subsystems:
-
The cpufreq code goes to great lengths trying to evaluate the actual CPU load through indirect means, including heuristics to avoid misprediction, while, once again, the scheduler has all this information available already.
-
The scheduler can determine the load contribution of individual tasks, whereas the cpufreq code has no such ability. In the occurrence of a task migration, or a task waking up, the scheduler may determine in advance what the load on the target CPU is likely to become. The cpufreq code may only notice an average load increase and react to it after the fact.
-
The scheduler records the execution time for each task in order to ensure fairness between all tasks. However, since the scheduler has no awareness of CPU frequency changes, tasks executing on a CPU whose clock has been slowed down will be unfairly charged more execution time than similar tasks running on another CPU with a faster clock for the same amount of work. Fairness is thus compromised.
-
As the CPU clock frequency is reduced, the resulting apparent increase in task load may trigger load balancing toward a less-loaded CPU in order to spread the load, despite the fact that this increased apparent load was indeed the cpufreq's goal initially.
-
And if that load balancing happens while the target CPU's clock frequency is reduced, then that CPU could end up being oversubscribed. Because there is no coordination between the scheduler and cpufreq, either (or both) of them may react by, respectively, migrating a task back or raising the CPU clock frequency. The CPU may suddenly be underutilized, and the cycle could repeat again.
To fix this, the current plan is to integrate cpufreq more tightly with the scheduler. The various platform-specific, low-level cpufreq drivers will remain unchanged and still register with the cpufreq core as usual; however, the governors — the part that decides what clock frequency to ask for and when — could be bypassed entirely. In fact, the scheduler could simply register itself as a new governor with the cpufreq core.
The advantage of a tighter integration of cpufreq with the scheduler is the ability to be proactive with clock frequency changes rather than reactive, and also to coordinate better with scheduler activities like load balancing. A CPU clock frequency change could be requested in anticipation of a load change; this could happen in response to a call to fork(), exit(), or when a task sleeps or wakes up. The frequency policy could be different depending on the particular scheduler event, task historical behavior patterns, etc.
However, to be able to perform well in the presence of varying CPU clock frequencies, the notion of scale-invariant task load tracking must be added to the scheduler. This is a correction factor to normalize load measurements from CPUs executing code at different speeds so the load contribution of a task can be predicted when the task is moved. The relative computation capacity of each CPU as seen by the scheduler also has to be adjusted according to its effective clock frequency in order to do proper load balancing. It is still unclear how accurate this correction factor can be, considering that tasks making lots of memory accesses are less likely to be influenced by the CPU clock speed compared to tasks performing computational work, etc. Still, anything is going to be better than no correction at all like we have today.
Incidentally, the scale-invariant load tracking does apply to big.LITTLE scheduling as well. Leaving cpufreq aside for a moment, a "little" CPU is permanently slowed down, which translates into a reduced compute capacity to the scheduler, and conversely a "big" CPU has more capacity. With distinct correction factors permanently applied to "big" and "little" CPUs, the scheduler is likely to just work optimally in terms of task throughput, with no further changes to the scheduler. The cpufreq correction factor simply has to be combined with the "big" and "little" factors afterward.
Several developers, including Mike Turquette and Tuukka Tikkanen, are working on the cpufreq integration and initial patches should be posted for public review soon.
Scheduler: may the power be with you
Okay… So we might get to the point where cpuidle and cpufreq are tightly integrated with the scheduler. Are we done? Unlikely. In fact we now have more difficult decisions to make than before and they all relate to the new mechanisms at the scheduler's disposal to perform load balancing. For example:
-
When the system load goes up, should a new CPU be brought out of idle, or should the clock frequency on an already running CPU be increased instead? Or both?
-
Conversely, when the system load goes down, is it best to keep more CPUs alive with a reduced clock frequency or pack more tasks on fewer CPUs in order to send the other CPUs to sleep?
-
Is it best to consolidate loads onto fewer CPUs or to spread the load over more CPUs?
-
When is it time to perform active task packing to let a whole CPU cluster (or package) get into low-power mode?
The latest power-aware scheduler work from Morten Rasmussen provides a framework to evaluate the power cost of the available scheduling scenarios. This, in combination with Vincent Guittot's sched_domain topology and CPU capacity tracking consolidation work, should provide answers to the above questions.
What else?
We desperately need measurement tools to validate proposed solutions. Linaro is working on a tool called idlestat to validate idle-state usage and its effectiveness. Traditional benchmark tools such as sysbench may be combined with energy usage monitoring to provide a way to perform power characterization of a system. Extensions to cyclictest to create various synthetic workloads are being explored as well. This is still unwieldy, though, and more integration and automation are required.
This article hasn't covered thermal management. The Linux kernel implements a thermal-management interface that allows a user-space daemon to control thermal constraints. However, as we've seen, power-related issues are intertwined, and a thermal-control solution that lives separately from the scheduler is likely to be suboptimal. If the scheduler controls power states, it will also have to deal with platform temperature someday, providing thermal "provisioning" or the like. But we can save this for another day.
Thanks
Thanks to Amit Kucheria, Daniel Lezcano, Kevin Hilman, Mike Turquette, and Vincent Guittot for their help in reviewing this article.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>