Kernel development

Brief items

Kernel release status

The current development kernel is 3.13-rc1, released on November 22. In the end, 10,518 non-merge changesets were pulled into the mainline during this merge window. Now the stabilization period starts, with the final 3.13 release due around the end of the year.

Stable updates: no stable updates have been released in the last week. 3.12.2, 3.11.10, 3.10.21, and 3.4.71 are in the review process as of this writing; they can be expected sometime on or after November 28. Note that 3.11.10 is expected to be the final update to the 3.11 kernel.

Comments (none posted)

Quotes of the week

futexes are no place for believe. Either you understand them completely or you just leave them alone.

— Thomas Gleixner

From your description, it sounds like SPECTRE is actually trying to make the job easier for the operating system to some degree by defining a standard hardware platform. If this actually works out and they hardware people don't screw up too much, supporting that platform should be a no-brainer, and I see no fundamental problem with adding ACPI support for that. [...]

Unfortunately it is impossible to know at this point what work is actually relevant for SPECTRE and what is not, so we can't really merge anything specific to ARM64+ACPI until we have access to an actual spec, or we get a video message by someone with a monocle and a lap cat to shed some more light on the actual requirements.

— Arnd Bergmann

Comments (2 posted)

Checkpoint/restore tool v1.0

After years of work, version 1.0 of the checkpoint/restore tool is available. This is a mostly user-space-based tool that is able to capture the state of a set of processes to persistent storage and restore it at some future time, possibly on a different system. See this 2013 Kernel Summit article for details on the current state of this functionality.

Comments (13 posted)

Facebook likes Btrfs

Two Btrfs developers — Chris Mason and Josef Bacik — have simultaneously announced their departure from Fusion IO to work for Facebook instead. Chris says: "From a Btrfs point of view, very little will change. All of my Btrfs contributions will remain open and I'll continue to do all of my development upstream." Josef adds "Facebook is committed to the success of Btrfs so not much will change as far as my involvement with the project, I will still be maintaining btrfs-next and working on stability."

Comments (none posted)

Kernel development news

The conclusion of the 3.13 merge window

By Jonathan Corbet
November 26, 2013

Linus released 3.13-rc1 and closed the 3.13 merge window on November 22, perhaps a couple of days earlier than some developers might have expected. Counting a couple of post-rc1 straggler pulls, some 10,600 non-merge changesets were pulled into the mainline for this development cycle; that is about 700 since last week's summary.

As might be expected, the list of user-visible features included in that relatively small set of patches is short; it includes:

The squashfs filesystem now has multi-threaded decompression; it can also decompress directly into the page cache, eliminating the temporary buffer used previously.
There have been several changes to the kernel's key-storage subsystem. The maximum number of keys has increased to an essentially unlimited value, allowing, for example, the NFS code to store vast numbers of user ID mapping values as keys. There is a new concept of a "trusted" key, being one obtained from the hardware or otherwise validated, and keyrings can be marked as allowing only trusted keys. Finally, a mechanism for persistent keys not attached to a given user ID has been added, and key data can be quite large; both of these changes were needed to enable Kerberos to use the key subsystem.
New hardware support includes:
- Input: Samsung SUR40 touchscreens.
- Security: Nuvoton NPCT501 I2C trusted platform modules, Atmel AT97SC3204T I2C trusted platform modules, OMAP34xx random number generators, Qualcomm MSM random number generators, and Freescale cryptographic accelerators (job ring support).

Changes visible to kernel developers include:

There is a new associative array data structure in the kernel. It was added to support the keyring work, but could be applicable in other situations as well. See Documentation/assoc_array.txt for details.
The information in struct page is now even more dense with the addition of Joonsoo Kim's patch set to have the slab allocator store more information there. See this article for details.

Now the final stabilization phase for all of this work begins. Your editor predicts that the final 3.13 kernel will be released sometime between the New Year and the beginning of linux.conf.au 2014 on January 6.

Comments (1 posted)

The tick broadcast framework

November 26, 2013

This article was contributed by Preeti U Murthy.

Power management is an increasingly important responsibility of almost every subsystem in the Linux kernel. One of the most established power management mechanisms in the kernel is the cpuidle framework which puts idle CPUs into sleeping states until they have work to do. These sleeping states are called the "C-states" or CPU operating states. The deeper a C-state, the more power is conserved.

However, an interesting problem surfaces when CPUs enter certain deep C-states. Idle CPUs are typically woken up by their respective local timers when there is work to be done, but what happens if these CPUs enter deep C-states in which these timers stop working? Who will wake up the CPUs in time to handle the work scheduled on them? This is where the "tick broadcast framework" steps in. It assigns a clock device that is not affected by the C-states of the CPUs as the timer responsible for handling the wakeup of all those CPUs that enter deep C-states.

Overview of the tick broadcast framework

In the case of an idle or a semi-idle system, there could be more than one CPU entering a deep idle state where the local timer stops. These CPUs may have different wakeup times. How is it possible to keep track of when to wake up the CPUs, considering a timer is merely a clock device that cannot keep track of more information than the time at which it is supposed to interrupt? The tick broadcast framework in the kernel provides the necessary infrastructure to handle the wakeup of such CPUs at the right time.

Before looking into the tick broadcast framework, it is important to understand how the CPUs themselves keep track locally of when their respective pending events need to be run.

The kernel keeps track of the time at which a deferred task needs to be run based on the concept of timeouts. The timeouts are implemented using clock devices called timers which have the capacity to raise an interrupt at a specified time. In the kernel, such devices are called the "clock event" devices. Each CPU is equipped with a local clock event device that is programmed to interrupt at the time of the next-to-run deferred task on that CPU, so that said task can be scheduled on the CPU. These local clock event devices can also be programmed to fire periodically to do regular housekeeping jobs like updating the jiffies value, checking if a task has to be scheduled out, etc. These timers are therefore called the "tick devices" in the kernel and are represented by struct tick_device.

A per-CPU tick_device representing the local timer is declared using the variable tick_cpu_device. Every CPU keeps track of when its local timer needs to interrupt it next in its copy of tick_cpu_device as next_event and programs the local timer with this value. To be more precise, the value can be found in tick_cpu_device->evtdev->next_event, where evtdev is an instance of the clock event device mentioned above.

The external clock device that is required to stand in for the local timers in some deep idle states is just another tick device, but is not normally required to keep track of events for specific CPUs. This device is represented by tick_broadcast_device (defined in kernel/time/tick-broadcast.c), in contrast to tick_cpu_device.

Registering a timer as the tick_broadcast_device

During the initialization of the kernel, every timer in the system registers itself as a tick_device. In the kernel, these timers are associated with some flags which define their properties. That property which is of special interest to us is represented by the flag CLOCK_EVT_FEAT_C3STOP. This means that in the C3 idle state, the timer stops. Although the C3 idle state is specific to the x86 architecture, this feature flag is generally used to convey that the timer stops in one of the deep idle states.

Any timers which do not have the flag CLOCK_EVT_FEAT_C3STOP set are potential candidates for tick_broadcast_device. Since all local timers have this flag set on architectures where they stop in deep idle states, all of them become ineligible for this role. On architectures like x86, there is an external device called the HPET — High Precision Event Timer — which becomes a suitable candidate. Since the HPET is placed external to the processor, the idle power management for a CPU does not affect it. Naturally it does not have the CLOCK_EVT_FEAT_C3STOP flag set among its properties and becomes the choice for tick_broadcast_device.

Tracking the CPUs in deep idle states

Now we'll return to the way the tick broadcast framework keeps track of when to wake up the CPUs that enter idle states when their local timers stop. Just before a CPU enters such an idle state, it calls into the tick broadcast framework. This CPU is then added to a list of CPUs to be woken up; specifically, a bit is set for this CPU in a "broadcast mask".

Then a check is made to see if the time at which this CPU has to be woken up is prior to the time at which the tick_broadcast_device has been currently programmed. If so, the time at which the tick_broadcast_device should interrupt is updated to reflect the new value and this value is programmed into the external timer. The tick_cpu_device of the CPU that is going to deep idle state is now put in CLOCK_EVT_MODE_SHUTDOWN mode, meaning that it is no longer functional.

Each time a CPU goes to deep idle state, the above steps are repeated and the tick_broadcast_device is programmed to fire at the earliest of the wakeup times of the CPUs in deep idle states.

Waking up the CPUs in deep idle states

When the external timer expires, it interrupts one of the online CPUs, which scans the list of CPUs that have asked to be woken up to check if any of their wakeup times have been reached. That means the current time is compared to the tick_cpu_device->evtdev->next_event of each CPU. All those CPUs for which this is true are added to a temporary mask (different from the broadcast mask) and the appropriate next expiry time of the tick_broadcast_device is set to the earliest wakeup time of those CPUs. What remains to be seen is how the CPUs in the temporary mask are woken up.

Every tick device has a "broadcast method" associated with it. This method is an architecture-specific function encapsulating the way inter-processor interrupts (IPIs) are sent to a group of CPUs. Similarly, each local timer is also associated with this method. The broadcast method of the local timer of one of the CPUs in the temporary mask is invoked by passing it the same mask. IPIs are then sent to all the CPUs that are present in this mask. Since wakeup interrupts are sent to a group of CPUs, this framework is called the "broadcast" framework. The broadcast is done in tick_do_broadcast() in kernel/time/tick-broadcast.c.

The IPI handler for this specific interrupt needs to be that of the local timer interrupt itself so that the CPUs in deep idle states wake up as if they were interrupted by the local timers themselves. The effects of their local timers stopping on entering an idle state is hidden away from them; they should see the same state before and after wakeup and continue running like nothing had happened.

While handling the IPI, the CPUs call into the tick broadcast framework so that they can be removed from the broadcast mask, since it is known that they have received the IPI and have woken up. Their respective tick devices are brought out of the CLOCK_EVT_MODE_SHUTDOWN mode, indicating that they are back to being functional.

Conclusion

As can be seen from the above discussion, enabling deep idle states cause the kernel to have to do additional work. One would therefore naturally wonder if it is worth going through this trouble, since it could hamper performance in the process of saving power.

Idle CPUs enter deep C-states only if they are predicted to remain idle for a long time, on the order of milliseconds. Therefore, broadcast IPIs should be well spaced in time and are not so frequent as to affect the performance of the system. We could further optimize the tick broadcast framework by aligning the wakeup time of the idle CPUs to a periodic tick boundary whose interval is of the order of a few milliseconds so that CPUs going to idle at almost the same time choose the same wakeup time. By looking at more such ways to minimize the number of broadcast IPIs sent we could ensure that the overhead involved is insignificant compared to the large power savings that the deep idle states yield us. If this can be achieved, it is a good enough reason to enable and optimize an infrastructure for the use of deep idle states.

Acknowledgments

I would like to thank my technical mentor Vaidyanathan Srinivasan for having patiently reviewed the initial drafts, my manager Tarundeep Singh, and my teammates Srivatsa S. Bhat and Deepthi Dharwar for their guidance and encouragement during the drafting this article.

Many thanks to IBM Linux Technology Center and LWN for having provided this opportunity.

Comments (4 posted)

ACPI for ARM?

By Jonathan Corbet
November 22, 2013

The "Advanced Configuration and Power Interface" (ACPI) was not an obvious win when support for it was first merged into the mainline kernel. The standard was new, actual implementations were unreliable, and supporting it involved bringing a large virtual machine into the kernel. For years, booting with ACPI disabled was the first response to a wide range of problems; one can still find web sites advising readers to do that. But, for the most part, ACPI has settled in as a mandatory part of the PC platform standard. Now, however, it appears that a similar story may be about to play out in the ARM world.

Arguments for and against ACPI

There have been rumblings for a few years that ACPI would start to appear in ARM-based systems, and in server systems in particular. Recently, some code to support such systems has started to make the rounds; Olof Johansson, a co-maintainer of the arm-soc tree, looked at this code and didn't like what he saw:

The more I start to see early UEFI/ACPI code, the more I am certain that we want none of that crap in the kernel. It's making things considerably messier, while we're already very busy trying to convert everything over and enable DT -- we'll be preempting that effort just to add even more boilerplate everywhere and total progress will be hurt.

In this message and several followups Olof clarified what he was trying to get across. The ARM world already has a mechanism to describe the hardware — device trees — that is only now coming into focus. Adding device tree support has required making changes to a large amount of platform and driver code; supporting ACPI threatens to bring just as much work and add a second code path for system configuration that will need to be maintained forever. Even worse is the fact that there are no established standards for ACPI in the ARM setting; nobody really knows how things are supposed to work, and what is coming out in the early stages is not encouraging. Bringing in ARM ACPI support now would be committing the kernel community to supporting a moving target indefinitely.

Olof went on to suggest that it might be best to wait for others to figure out how ACPI on ARM is supposed to work:

Oh wait, there's people who have been doing this for years. Microsoft. They should be the ones driving this and taking the pain for it. Once the platform is enabled for their needs, we'll sort it out at our end. After all, that has worked reasonably well for x86 platforms.

He added that, until there are ACPI systems shipping with Windows and working well, the Linux community should stay far away from ACPI on ARM. If ACPI-based systems actually hit the market, he said, they can be supported with a pre-boot layer that translates the system's ACPI tables into the device tree format.

Disagreement with this position came in a couple of forms. Several people point out that standards developed by Microsoft may not suit the Linux community as well as we might like. As Mark Rutland (a device tree bindings maintainer) put it:

I'm not sure it's entirely reasonable to assume that Microsoft will swoop in and develop standards that are useful to us or even applicable to the vast majority of the systems that are likely to exist. If they do, then we will (by the expectation that Linux should be able to run wherever another OS can) have to support whatever standards they may create.

Russell King added another point echoed by many: refusing to support ACPI could cost the community its chance to influence (or even control) how the standard evolves. In his words:

We have a possibility here to define how we'd like ACPI to look: we have the chance to have ACPI properties using the same naming that we already have for DT.

Shutting the door on ACPI, Russell asserted, would be a move that the community would regret in the long term.

Jon Masters joined the conversation to make the claim that ARM-based servers were committed to the ACPI path, saying "all of the big boys are going to be using ACPI whether it's liked much or not". He said that the server space requires a mechanism that has been standardized and set in stone, and that, in his opinion, the device tree abstraction is far too unstable to be usable (a claim that Grant Likely strongly disagreed with). Red Hat, Jon said, is fully behind ACPI on ARM servers for all of the products that it has absolutely not said it will ever offer. Jon's wording, along with his suggestion that everything has already been decided in NDA-protected conference rooms, won him few friends in this discussion, but his point remains: there will be systems using ACPI on the market, and Linux has to deal with them somehow.

What to do

But that still doesn't answer the question of how to deal with them. Arnd Bergmann suggested that ACPI might not be a long-term issue for the ARM community:

I think we can still treat ACPI on ARM64 as a beginner's mistake and provide hand-written DT blobs for the few systems that start shipping with that. The main reason for doing it in the first place was the expected number of Windows RT servers, but WinRT isn't doing well at the moment, so it's not unreasonable to assume it's going the same way as WinRT tablets.

Most people, though, seemed to think that ACPI could be here to stay, so the community will have to figure out some way of dealing with it.

One possibility might be Olof's idea of translating the ACPI tables into a device tree, but that approach was somewhat unpopular. It looks to many like a partial answer to the problem that would run into no end of problems; there is also the matter of running the ACPI Machine Language (AML) code found in the ACPI firmware. AML can be necessary for hardware initialization and power management tasks, but it has no analog in the device tree world. Generally, there was a sentiment that, if ACPI is to be supported on ARM systems, it should be supported properly and not behind some sort of translation layer.

In the short term, some sort of translation to device trees — either at boot-time or done by hand — seems likely to be the outcome, though. Putting code into the kernel to support any ACPI-based systems that might appear in the near future just seems to many like a way to take on a long-term support burden for short-lived systems. What might start to tip the balance could be systems which, as Arnd described them, are "PCs with their x86 CPU removed and an ARM chip put in there"; adding ACPI support for those would be "harmless enough", he said. But Arnd seems to be strongly against adding ACPI support for complicated ARM-style systems.

Longer-term, the community is likely to watch and wait. Efforts will be made to direct the evolution of ACPI for ARM systems; Linaro, in particular, has developers engaged with that process now. And even Olof is open to bringing in ACPI support at some point in the future, once its supporters "seem to have their act together, have worked out their kinks and reached a usable stable platform". But that, he says, could be a couple of years from now.

Microsoft, through its dominance of the market for software on PC-class systems, was able to push hardware standards in directions it liked. In the ARM world, Linux dominates just as strongly, so it seems a bit surprising to be playing catch-up with shifts in the ARM platform in this way. Part of the problem, of course, is that there is no single Linux voice at the standards table; companies like Linaro and Red Hat are working on the problem, but they do not represent, or seemingly even talk to, the rest of the community on this topic. The fact that much of this work is done under non-disclosure agreements does not help; NDAs do not fit well with how community development is done.

In the end, it will certainly work out; it is hard to imagine any significant class of ARM-based hardware being successful without solid Linux support. It's mostly a matter of how much short- and long-term pain will have to be endured to make that support happen. For all the early complaining, ACPI has mostly worked out in the x86 world; it may well find a useful role in the ARM market as well.

Comments (32 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.13-rc1 is out ?

Sebastian Andrzej Siewior 3.12.1-rt4 ?

Steven Rostedt 3.10.20-rt17 ?

Architecture-specific

Alexandre Courbot ARM: support for the Trusted Foundations secure monitor ?

Vyacheslav Tyrtov Exynos 5410 Dual cluster support ?

Marc Carino ARM: brcmstb: Add Broadcom STB SoC support ?

David Long uprobes: Add uprobes support for ARM ?

Lorenzo Pieralisi arm64: suspend/resume implementation ?

Dave Young kexec kernel efi runtime support ?

Core kernel code

Waiman Long qrwlock: Introducing a queue read/write lock implementation ?

Tejun Heo sysfs: separate out kernfs, take #3 ?

Andi Kleen Add a text_poke syscall v2 ?

Development tools

Jiri Olsa perf tools: Add traceevent plugins support ?

Device drivers

Ivan Khoronzhuk Introduce AEMIF driver for Davinci/Keystone archs ?

Florian Meier ASoC: Add support for BCM2835 ?

David E. Box New Driver for IOSF-SB MBI access on Intel SOCs ?

Krzysztof Kozlowski mfd: max14577: Add max14577 MFD drivers ?

Bjorn Andersson pinctrl: Qualcomm 8x74 pinctrl driver ?

Jacek Anaszewski Add support for Exynox4x12 to the s5p-jpeg driver ?

Matt Porter USB Device Controller support for BCM281xx ?

Rob Clark Atomic/nuclear modeset/pageflip ?

Jacob Pan Hook up powerclamp with PM QOS and cpuidle ?

Tomasz Stanislawski phy: Add exynos-phy driver ?

Documentation

Paul E. McKenney Memory-barrier documentation updates ?

Filesystems and block I/O

mark.doffman@codethink.co.uk Add ceph root filesystem functionality and documentation. ?

Jan Kara fsnotify: Do not share events between notification groups ?

Memory management

Johannes Weiner mm: thrash detection-based file cache sizing v6 ?

Minchan Kim zram/zsmalloc promotion ?

Vladimir Davydov kmemcg shrinkers ?

Andi Kleen Expose sysctls for enabling slab/file_cache interleaving v2 ?

riel@redhat.com pseudo-interleaving NUMA placement ?

Security-related

Tetsuo Handa TaskTracker : Simplified thread information tracker. ?

Miscellaneous

Stephen Hemminger iproute2 3.12.0 release ?

Pablo Neira Ayuso iptables 1.4.21 release ?

Eric Leblond ulogd 2.0.3 release ?

Pavel Emelyanov Checkpoint-restore tool v1.0 ?

Page editor: Jonathan Corbet
Next page: Distributions>>