Kernel development
Brief items
Kernel release status
The current development kernel is 3.13-rc1, released on November 22. In the end, 10,518 non-merge changesets were pulled into the mainline during this merge window. Now the stabilization period starts, with the final 3.13 release due around the end of the year.Stable updates: no stable updates have been released in the last week. 3.12.2, 3.11.10, 3.10.21, and 3.4.71 are in the review process as of this writing; they can be expected sometime on or after November 28. Note that 3.11.10 is expected to be the final update to the 3.11 kernel.
Quotes of the week
Unfortunately it is impossible to know at this point what work is actually relevant for SPECTRE and what is not, so we can't really merge anything specific to ARM64+ACPI until we have access to an actual spec, or we get a video message by someone with a monocle and a lap cat to shed some more light on the actual requirements.
Checkpoint/restore tool v1.0
After years of work, version 1.0 of the checkpoint/restore tool is available. This is a mostly user-space-based tool that is able to capture the state of a set of processes to persistent storage and restore it at some future time, possibly on a different system. See this 2013 Kernel Summit article for details on the current state of this functionality.Facebook likes Btrfs
Two Btrfs developers — Chris Mason and Josef Bacik — have simultaneously announced their departure from Fusion IO to work for Facebook instead. Chris says: "From a Btrfs point of view, very little will change. All of my Btrfs contributions will remain open and I'll continue to do all of my development upstream." Josef adds "
Facebook is committed to the success of Btrfs so not much will change as far as my involvement with the project, I will still be maintaining btrfs-next and working on stability."
Kernel development news
The conclusion of the 3.13 merge window
Linus released 3.13-rc1 and closed the 3.13 merge window on November 22, perhaps a couple of days earlier than some developers might have expected. Counting a couple of post-rc1 straggler pulls, some 10,600 non-merge changesets were pulled into the mainline for this development cycle; that is about 700 since last week's summary.As might be expected, the list of user-visible features included in that relatively small set of patches is short; it includes:
- The squashfs filesystem now has multi-threaded decompression; it can
also decompress directly into the page cache, eliminating the
temporary buffer used previously.
- There have been several changes to the kernel's key-storage subsystem.
The maximum number of keys has increased to an essentially unlimited
value, allowing, for example, the NFS code to store vast numbers of
user ID mapping values as keys. There is a new concept of a "trusted"
key, being one obtained from the hardware or otherwise validated, and
keyrings can be marked as allowing only trusted keys. Finally, a
mechanism for persistent keys not attached to a given user ID has been
added, and key data can be quite large; both of these changes were
needed to enable Kerberos to use the key subsystem.
- New hardware support includes:
- Input:
Samsung SUR40 touchscreens.
- Security: Nuvoton NPCT501 I2C trusted platform modules, Atmel AT97SC3204T I2C trusted platform modules, OMAP34xx random number generators, Qualcomm MSM random number generators, and Freescale cryptographic accelerators (job ring support).
- Input:
Samsung SUR40 touchscreens.
Changes visible to kernel developers include:
- There is a new associative array data structure in the kernel. It was
added to support the keyring work, but could be applicable in other
situations as well. See Documentation/assoc_array.txt for
details.
- The information in struct page is now even more dense with the addition of Joonsoo Kim's patch set to have the slab allocator store more information there. See this article for details.
Now the final stabilization phase for all of this work begins. Your editor predicts that the final 3.13 kernel will be released sometime between the New Year and the beginning of linux.conf.au 2014 on January 6.
The tick broadcast framework
Power management is an increasingly important responsibility of almost every subsystem in the Linux kernel. One of the most established power management mechanisms in the kernel is the cpuidle framework which puts idle CPUs into sleeping states until they have work to do. These sleeping states are called the "C-states" or CPU operating states. The deeper a C-state, the more power is conserved.However, an interesting problem surfaces when CPUs enter certain deep C-states. Idle CPUs are typically woken up by their respective local timers when there is work to be done, but what happens if these CPUs enter deep C-states in which these timers stop working? Who will wake up the CPUs in time to handle the work scheduled on them? This is where the "tick broadcast framework" steps in. It assigns a clock device that is not affected by the C-states of the CPUs as the timer responsible for handling the wakeup of all those CPUs that enter deep C-states.
Overview of the tick broadcast framework
In the case of an idle or a semi-idle system, there could be more than one CPU entering a deep idle state where the local timer stops. These CPUs may have different wakeup times. How is it possible to keep track of when to wake up the CPUs, considering a timer is merely a clock device that cannot keep track of more information than the time at which it is supposed to interrupt? The tick broadcast framework in the kernel provides the necessary infrastructure to handle the wakeup of such CPUs at the right time.
Before looking into the tick broadcast framework, it is important to understand how the CPUs themselves keep track locally of when their respective pending events need to be run.
The kernel keeps track of the time at which a deferred task needs to be run based on the concept of timeouts. The timeouts are implemented using clock devices called timers which have the capacity to raise an interrupt at a specified time. In the kernel, such devices are called the "clock event" devices. Each CPU is equipped with a local clock event device that is programmed to interrupt at the time of the next-to-run deferred task on that CPU, so that said task can be scheduled on the CPU. These local clock event devices can also be programmed to fire periodically to do regular housekeeping jobs like updating the jiffies value, checking if a task has to be scheduled out, etc. These timers are therefore called the "tick devices" in the kernel and are represented by struct tick_device.
A per-CPU tick_device representing the local timer is declared using the variable tick_cpu_device. Every CPU keeps track of when its local timer needs to interrupt it next in its copy of tick_cpu_device as next_event and programs the local timer with this value. To be more precise, the value can be found in tick_cpu_device->evtdev->next_event, where evtdev is an instance of the clock event device mentioned above.
The external clock device that is required to stand in for the local timers in some deep idle states is just another tick device, but is not normally required to keep track of events for specific CPUs. This device is represented by tick_broadcast_device (defined in kernel/time/tick-broadcast.c), in contrast to tick_cpu_device.
Registering a timer as the tick_broadcast_device
During the initialization of the kernel, every timer in the system registers itself as a tick_device. In the kernel, these timers are associated with some flags which define their properties. That property which is of special interest to us is represented by the flag CLOCK_EVT_FEAT_C3STOP. This means that in the C3 idle state, the timer stops. Although the C3 idle state is specific to the x86 architecture, this feature flag is generally used to convey that the timer stops in one of the deep idle states.
Any timers which do not have the flag CLOCK_EVT_FEAT_C3STOP set are potential candidates for tick_broadcast_device. Since all local timers have this flag set on architectures where they stop in deep idle states, all of them become ineligible for this role. On architectures like x86, there is an external device called the HPET — High Precision Event Timer — which becomes a suitable candidate. Since the HPET is placed external to the processor, the idle power management for a CPU does not affect it. Naturally it does not have the CLOCK_EVT_FEAT_C3STOP flag set among its properties and becomes the choice for tick_broadcast_device.
Tracking the CPUs in deep idle states
Now we'll return to the way the tick broadcast framework keeps track of when to wake up the CPUs that enter idle states when their local timers stop. Just before a CPU enters such an idle state, it calls into the tick broadcast framework. This CPU is then added to a list of CPUs to be woken up; specifically, a bit is set for this CPU in a "broadcast mask".
Then a check is made to see if the time at which this CPU has to be woken up is prior to the time at which the tick_broadcast_device has been currently programmed. If so, the time at which the tick_broadcast_device should interrupt is updated to reflect the new value and this value is programmed into the external timer. The tick_cpu_device of the CPU that is going to deep idle state is now put in CLOCK_EVT_MODE_SHUTDOWN mode, meaning that it is no longer functional.
Each time a CPU goes to deep idle state, the above steps are repeated and the tick_broadcast_device is programmed to fire at the earliest of the wakeup times of the CPUs in deep idle states.
Waking up the CPUs in deep idle states
When the external timer expires, it interrupts one of the online CPUs, which scans the list of CPUs that have asked to be woken up to check if any of their wakeup times have been reached. That means the current time is compared to the tick_cpu_device->evtdev->next_event of each CPU. All those CPUs for which this is true are added to a temporary mask (different from the broadcast mask) and the appropriate next expiry time of the tick_broadcast_device is set to the earliest wakeup time of those CPUs. What remains to be seen is how the CPUs in the temporary mask are woken up.
Every tick device has a "broadcast method" associated with it. This method is an architecture-specific function encapsulating the way inter-processor interrupts (IPIs) are sent to a group of CPUs. Similarly, each local timer is also associated with this method. The broadcast method of the local timer of one of the CPUs in the temporary mask is invoked by passing it the same mask. IPIs are then sent to all the CPUs that are present in this mask. Since wakeup interrupts are sent to a group of CPUs, this framework is called the "broadcast" framework. The broadcast is done in tick_do_broadcast() in kernel/time/tick-broadcast.c.
The IPI handler for this specific interrupt needs to be that of the local timer interrupt itself so that the CPUs in deep idle states wake up as if they were interrupted by the local timers themselves. The effects of their local timers stopping on entering an idle state is hidden away from them; they should see the same state before and after wakeup and continue running like nothing had happened.
While handling the IPI, the CPUs call into the tick broadcast framework so that they can be removed from the broadcast mask, since it is known that they have received the IPI and have woken up. Their respective tick devices are brought out of the CLOCK_EVT_MODE_SHUTDOWN mode, indicating that they are back to being functional.
Conclusion
As can be seen from the above discussion, enabling deep idle states cause the kernel to have to do additional work. One would therefore naturally wonder if it is worth going through this trouble, since it could hamper performance in the process of saving power.
Idle CPUs enter deep C-states only if they are predicted to remain idle for a long time, on the order of milliseconds. Therefore, broadcast IPIs should be well spaced in time and are not so frequent as to affect the performance of the system. We could further optimize the tick broadcast framework by aligning the wakeup time of the idle CPUs to a periodic tick boundary whose interval is of the order of a few milliseconds so that CPUs going to idle at almost the same time choose the same wakeup time. By looking at more such ways to minimize the number of broadcast IPIs sent we could ensure that the overhead involved is insignificant compared to the large power savings that the deep idle states yield us. If this can be achieved, it is a good enough reason to enable and optimize an infrastructure for the use of deep idle states.
Acknowledgments
I would like to thank my technical mentor Vaidyanathan Srinivasan for having patiently reviewed the initial drafts, my manager Tarundeep Singh, and my teammates Srivatsa S. Bhat and Deepthi Dharwar for their guidance and encouragement during the drafting this article.
Many thanks to IBM Linux Technology Center and LWN for having provided this opportunity.
ACPI for ARM?
The "Advanced Configuration and Power Interface" (ACPI) was not an obvious win when support for it was first merged into the mainline kernel. The standard was new, actual implementations were unreliable, and supporting it involved bringing a large virtual machine into the kernel. For years, booting with ACPI disabled was the first response to a wide range of problems; one can still find web sites advising readers to do that. But, for the most part, ACPI has settled in as a mandatory part of the PC platform standard. Now, however, it appears that a similar story may be about to play out in the ARM world.
Arguments for and against ACPI
There have been rumblings for a few years that ACPI would start to appear in ARM-based systems, and in server systems in particular. Recently, some code to support such systems has started to make the rounds; Olof Johansson, a co-maintainer of the arm-soc tree, looked at this code and didn't like what he saw:
In this message and several followups Olof clarified what he was trying to get across. The ARM world already has a mechanism to describe the hardware — device trees — that is only now coming into focus. Adding device tree support has required making changes to a large amount of platform and driver code; supporting ACPI threatens to bring just as much work and add a second code path for system configuration that will need to be maintained forever. Even worse is the fact that there are no established standards for ACPI in the ARM setting; nobody really knows how things are supposed to work, and what is coming out in the early stages is not encouraging. Bringing in ARM ACPI support now would be committing the kernel community to supporting a moving target indefinitely.
Olof went on to suggest that it might be best to wait for others to figure out how ACPI on ARM is supposed to work:
He added that, until there are ACPI systems shipping with Windows and working well, the Linux community should stay far away from ACPI on ARM. If ACPI-based systems actually hit the market, he said, they can be supported with a pre-boot layer that translates the system's ACPI tables into the device tree format.
Disagreement with this position came in a couple of forms. Several people point out that standards developed by Microsoft may not suit the Linux community as well as we might like. As Mark Rutland (a device tree bindings maintainer) put it:
Russell King added another point echoed by many: refusing to support ACPI could cost the community its chance to influence (or even control) how the standard evolves. In his words:
Shutting the door on ACPI, Russell asserted, would be a move that the community would regret in the long term.
Jon Masters joined the conversation to make
the claim that
ARM-based servers were committed to the ACPI path, saying "all of the
big boys are going to be using ACPI whether it's liked much or
not
". He said that the server space requires a mechanism that has
been standardized and set in stone, and that, in his opinion, the device
tree abstraction is far too unstable to be usable (a claim that Grant Likely strongly disagreed with). Red Hat, Jon said, is fully behind ACPI on ARM servers for
all of the products that it has absolutely not said it will ever offer.
Jon's wording, along with his suggestion that everything has already been
decided in NDA-protected conference rooms, won him few friends in this
discussion, but his point
remains: there will be systems using ACPI on the market, and Linux has to
deal with them somehow.
What to do
But that still doesn't answer the question of how to deal with them. Arnd Bergmann suggested that ACPI might not be a long-term issue for the ARM community:
Most people, though, seemed to think that ACPI could be here to stay, so the community will have to figure out some way of dealing with it.
One possibility might be Olof's idea of translating the ACPI tables into a device tree, but that approach was somewhat unpopular. It looks to many like a partial answer to the problem that would run into no end of problems; there is also the matter of running the ACPI Machine Language (AML) code found in the ACPI firmware. AML can be necessary for hardware initialization and power management tasks, but it has no analog in the device tree world. Generally, there was a sentiment that, if ACPI is to be supported on ARM systems, it should be supported properly and not behind some sort of translation layer.
In the short term, some sort of translation to device trees — either at
boot-time or done by hand — seems likely to
be the outcome, though. Putting code into the kernel to support any
ACPI-based systems that might appear in the near future just seems to many
like a way to take on a long-term support burden for short-lived systems.
What might start to tip the balance could be systems which, as Arnd described them, are "
Longer-term, the community is likely to watch and wait. Efforts will be
made to direct the evolution of ACPI for ARM systems; Linaro, in
particular, has developers engaged with that process now. And even Olof is
open to bringing in ACPI support at some
point in the future, once its supporters "
Microsoft, through its dominance of the market for software on PC-class
systems, was able to push hardware standards in directions it liked. In
the ARM world, Linux dominates just as strongly, so it seems a bit
surprising to be playing catch-up with shifts in the ARM platform in this
way. Part of the problem, of course, is that there is no single Linux
voice at the standards table; companies like Linaro and Red Hat are working
on the problem, but they do not represent, or seemingly even talk to, the
rest of the community on this topic. The fact that much of this work is
done under
non-disclosure agreements does not help; NDAs do not fit well with how
community development is done.
In the end, it will certainly work out; it is hard to imagine any
significant class of ARM-based hardware being successful without solid
Linux support. It's mostly a matter of how much short- and long-term pain
will have to be endured to make that support happen. For all the early
complaining, ACPI has mostly worked out in the x86 world; it may well find
a useful role in the ARM market as well.
PCs with their x86 CPU
removed and an ARM chip put in there
"; adding ACPI support for those
would be "harmless enough
", he said. But Arnd seems to be
strongly against adding ACPI support for complicated ARM-style systems.
seem to have their act
together, have worked out their kinks and reached a usable stable
platform
". But that, he says, could be a couple of years from now.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
