LWN.net Logo

Kernel development

Brief items

Kernel release status

The 3.9 merge window is open so there is no current development kernel. See the separate article below for a summary of changes merged into the mainline for 3.9 so far.

Stable updates: 3.4.33 and 3.0.66 were released on February 21; they are single-patch updates fixing a security issue in the printk() code. 3.5.7.6 was released on February 22, and 3.7.10 (the final planned 3.7 update) was released on February 27.

As of this writing, the 3.8.1, 3.4.34, and 3.0.67 updates are in the review process; they can be expected on or after February 28.

Comments (2 posted)

Quotes of the week

Note that as of 5eaf563e53294d6696e651466697eb9d491f3946, you can now mount filesystems as an unprivileged user after a call to unshare(CLONE_NEWUSER | CLONE_NEWNS), or a similar clone(2) call. This means all those random random filesystem bugs you have laying around in the junk bin are now quite useful. ++tricks;
Jason A. Donenfeld

I suspect part of the problem is scale. Most people don't understand the scale at which the Linux Kernel and vendors handle bug fixes and code changes. External people simply see a few poorly handled security related issues and probably think "well how hard can it be to properly a few extra security flaws?" but they don't see that those 5 security issues were buried in 10,000 other code fixes. The resources needed to audit every code change for a security impact simply aren't available (and even if we had enough talented people who exactly is going to pay them all?).
Kurt Seifried

This naming alone would inhibit [BUG_ON()] use through two channels:

  • Putting the word 'CRASH' into your code feels risky, dissonant and wrong (perfect code does not crash) and thus needs conscious frontal lobe effort to justify it - while BUG_ON() really feels more like a harmless assert to most kernel developers, which is in our muscle memory through years training.

  • CRASH_ON() takes one character more typing than WARN_ON(), and we know good kernel developers are fundamentally lazy.
Ingo Molnar

Comments (19 posted)

Kernel development news

3.9 Merge window, second episode

By Jonathan Corbet
February 27, 2013
As of this writing, just over 8,000 non-merge changesets have been pulled into the mainline for the 3.9 development cycle — 7,600 since last week's summary. Quite a few new features of interest have been merged for the 3.9 kernel; the most significant of those are listed below.

But first, a warning for development kernel testers: there are reports of ext4 filesystem corruption with current mainline kernels. The problem appears to have been identified and fixed, but it will remain as a permanent hazard for anybody running bisections over the 3.9 merge window. Development kernels have not often lived up to their fearsome reputation recently, but they can still bite at times.

  • The ARM architecture has gained support for the KVM virtualization mechanism on Cortex-A15 processors. Support for the ARM "power state coordination interface" has been added so that virtual CPU's can be "powered up" and down.

  • The socket filtering mechanism has a new SO_LOCK_FILTER option that prevents further changes to the filter. It is intended for privileged programs that install a filter before running untrusted code.

  • TCP and UDP sockets have a new option, SO_REUSEPORT, that allows multiple sockets listening for new connections or packets (respectively) at the same time. See this commit message for more information.

  • The netfilter connection-tracking code now supports "connection labels," which are bitmasks that can be attached to tracking entries and tested by netfilter rules.

  • The wireless networking subsystem has gained core support for the detection of radar systems operating on the networking frequencies; this is a necessary component for dynamic frequency selection in the 5GHz range.

  • VMware's "VM Sockets" subsystem, a mechanism for communication between virtual machines and a hypervisor, has been merged. Also merged is the "Virtual Machine Communication Interface" subsystem for high-speed communication between the host and guests.

  • The networking layer has support for the "Multiple VLAN Registration Protocol" (MVRP), which facilitates communication about registered virtual networks to switches.

  • The block layer's handling of pages under writeback has been changed to address the performance penalty imposed by the previous "stable pages" work.

  • The PowerPC architecture supports a new set of transactional memory instructions; at this time, only user-space support is provided (the kernel does not use these instructions). See Documentation/powerpc/transactional_memory.txt for more information.

  • The Xen virtualization subsystem gained support for ACPI-based CPU and memory hotplugging, though, in both cases, only the "add" operation is supported currently.

  • The ext4 filesystem now supports hole punching in block-mapped files.

  • A long list of old network drivers has been deleted; these include the venerable 3c501, 3c505, and 3c507 drivers, various Intel i825xx drivers, parallel port-based drivers(!), and many more. It is expected that these drivers will not be missed, as many of them did not work all that well in the first place. As Paul Gortmaker put it: "You know things are not good when the Kconfig help text suggests you make a cron job doing a ping every minute." The long-unused "WAN router" subsystem has also been removed.

  • New hardware support includes:

    • Systems and processors: NVIDIA Tegra114 SoCs, the ARM "dummy virtual machine" (a minimal stub platform for virtualization uses), Prodrive PPA8548 AMC modules, and Tensilica Diamond 233L Standard core Rev.C processors.

    • Audio: NVIDIA Tegra20 AC97 interfaces.

    • Block: Renesas R-Car SATA controllers and Broadcom BCM2835 SD/MMC controllers.

    • Graphics: Marvell MMP display controllers, Samsung LMS501KF03 LCD panels, Himax HX-8357 LCD panels, Austrian Microsystems AS3711 backlight controllers, TI LCDC display controllers, and NXP Semiconductors TDA998X HDMI encoders.

    • Input: Steelseries SRW-S1 steering wheel devices.

    • Miscellaneous: STMicroelectronics ST33 I2C TPM devices, STMicroelectronics accelerometers, magnetometers, and gyroscopes, InvenSense ITG3200 digital 3-axis gyroscopes, Invensense MPU6050 gyroscope/accelerometer devices, NVIDIA Tegra20/30 SoC serial controllers, Comtrol RocketPort EXPRESS/INFINITY serial adapters, PCI-Express non-transparent bridges, Maxim MAX77686 and MAX8997 realtime clocks (RTCs), TI LP8788 RTCs, TI TPS80031/TPS80032 RTCs, Epson RX-4581 RTCs, ST-Ericsson Ux500 watchdogs, Intel Lynxpoint GPIO controllers, Atmel Timer Counter pulse-width modulators, TI/National LP5521 and LP5523/55231 LED controllers, Intel iSMT SMBus host controllers, and Broadcom BCM2835 I2C controllers.

    • Networking: 8devices USB2CAN interfaces and Inside Secure microread NFC interfaces.

    • USB: SMSC USB3503 USB 2.0 hub controllers.

    • Video4Linux: SuperH VEU mem2mem video processors, TI DM365 VPFE media controllers, Montage Technology TS2020-based tuners, Masterkit MA901 USB FM radios, OmniVision OV9650/OV9652 sensors, and Samsung S5C73M3 sensors.

    • Staging graduations: the Analog Devices ADXRS450/3 Digital Output Gyroscope SPI driver, Analog Devices ADIS16400 inertial sensor driver, Analog Devices ADIS16080/100 yaw rate gyroscope driver, Kionix KXSD9 accelerometer driver, TAOS TSL2560, TSL2561, TSL2562 and TSL2563 ambient light sensor driver, and OMAP direct rendering driver have been moved out of the staging tree and into the mainline kernel.

Changes visible to kernel developers include:

  • The netpoll mechanism now supports IPv6, allowing network consoles to be run over IPv6 networks.

  • Most drivers no longer depend on the EXPERIMENTAL configuration option. So much code needed that option that it is turned on almost universally, with the result that it does not actually mean anything. So now it defaults to "yes," and it will soon be removed entirely.

  • The sound layer has a generic parser for Intel high definition audio (HDA) codecs. Many drivers have been converted to use this parser, resulting in the removal of a great deal of duplicated code.

  • The __get_user_8() function is now available on 32-bit x86 systems; it will fetch a 64-bit quantity from user space.

  • The module signing code has a few usability enhancements. The sign-file utility has new options to specify which hash algorithm to use or to simply provide the entire signature (which will have been computed elsewhere). There is also a new MODULE_SIG_ALL configuration option that controls whether modules are automatically signed at modules_install time.

  • The descriptor-based GPIO patch set has been merged, with significant changes to how GPIO lines are handled within the kernel.

  • The new file_inode() helper should be used instead of the traditional file->f_dentry->d_inode pointer chain.

The merge window should stay open through approximately March 5, though, one assumes, the rate of change will drop off somewhat toward the end. Next week's edition will summarize the changes that go in for the final part of the 3.9 merge window.

Comments (7 posted)

ELC: In-kernel switcher for big.LITTLE

By Jake Edge
February 27, 2013

The ARM big.LITTLE architecture has been the subject of a number of LWN articles (here's another) and conference talks, as well as a fair amount of code. A number of upcoming systems-on-chip (SoCs) will be using the architecture, so some kind of near-term solution for Linux support is needed. Linaro's Mathieu Poirier came to the 2013 Embedded Linux Conference to describe that interim solution: the in-kernel switcher.

Two kinds of CPUs

Big.LITTLE incorporates architecturally similar CPUs that have different power and performance characteristics. The similarity must consist of a one-to-one mapping between instruction sets on the two CPUs, so that code can "migrate seamlessly", Poirier said. Identical CPUs are grouped into clusters.

[Mathieu Poirier]

The SoC he has been using for testing consists of three Cortex-A7 CPUs (LITTLE: less performance, less power consumption) in one cluster and two Cortex-A15s (big) in the other. The SoC was deliberately chosen to have a different number of processors in the clusters as a kind of worst case to catch any problems that might arise from the asymmetry. Normally, one would want the same number of processors in each cluster, he said.

The clusters are connected with a cache-coherent interconnect, which can snoop the cache to keep it coherent between clusters. There is an interrupt controller on the SoC that can route any interrupt from or to any CPU. In addition, there is support in the SoC for I/O coherency that can be used to keep GPUs or other external processors cache-coherent, but that isn't needed for Linaro's tests.

The idea behind big.LITTLE is to provide a balance between power consumption and performance. The first idea was to run CPU-hungry tasks on the A15s, and less hungry tasks on the A7s. Unfortunately, it is "hard to predict the future", Poirier said, which made it difficult to make the right decisions because there is no way to know what tasks are CPU intensive ahead of time.

Two big.LITTLE approaches

That led Linaro to a two-pronged approach to solving the problem: Heterogeneous Multi-Processing (HMP) and the In-Kernel Switcher (IKS). The two projects are running in parallel and are both in the same kernel tree. Not only that, but you can enable either on the kernel command line or switch at run time via sysfs.

With HMP, all of the cores in the SoC can be used at the same time, but the scheduler needs to be aware of the capabilities of the different processors to make its decisions. It will lead to higher peak performance for some workloads, Poirier said. HMP is being developed in the open, and anyone can participate, which means it will take somewhat longer before it is ready, he said.

IKS is meant to provide a "solution for now", he said, one that can be used to build products with. The basic idea is that one A7 and one A15 are coupled into a single virtual CPU. Each virtual CPU in the system will then have the same capabilities, thus isolating the core kernel from the asymmetry of big.LITTLE. That means much less code needs to change.

Only one of the two processors in a virtual CPU is active at any given time, so the decision on which of the two to use can be made at the CPU frequency (cpufreq) driver level. IKS was released to Linaro members in December 2012, and is "providing pretty good results", Poirier said.

An alternate way to group the processors would be to put all the A15s together and all the A7s into another group. That turned out to be too coarse as it was "all or nothing" in terms of power and performance. There was also a longer synchronization period needed when switching between those groups. Instead, it made more sense to integrate "vertically", pairing A7s with A15s.

For the test SoC, the "extra" A7 was powered off, leaving two virtual CPUs to use. The processors are numbered (A15_0, A15_1, A7_0, A7_1) and then paired up (i.e. {A15_0, A7_0}) into virtual CPUs; "it's not rocket science", Poirier said. One processor in each group is turned off, but only the cpufreq driver and the switching logic need to know that there are more physical processors than virtual processors.

The virtual CPU presents a list of operating frequencies that encompass the range of frequencies that both A7 and A15 can operate at. While the numbers look like frequencies (ranging from 175MHz to 1200MHz in the example he gave), they don't really need to be as they are essentially just indexes into a table in the cpufreq driver. The driver maps those values to a real operating point for one of the two processors.

Switching CPUs

The cpufreq core is not aware of the big.LITTLE architecture, so the driver does a good bit of work, Poirier said, but the code for making the switching decision is simple. If the requested frequency can't be supported by the current processor, switch to the other. That part is eight lines of code, he said.

For example, if virtual CPU 0 is running on the A7 at 200MHz and a request comes in to go to 1.2GHz, the driver recognizes that the A7 cannot support that. In that case, it decides to power down the A7 (which is called the outbound processor) and power up the A15 (inbound). There is a synchronization process that happens as part of the transition so that the inbound processor can use the existing cache. That process is described in Poirier's slides [PDF], starting at slide 17.

The outbound processor powers up the inbound and continues executing normal kernel/user-space code until it receives the "inbound alive" signal. After sending that signal, the inbound processor initializes both the cluster and interconnect if it is the first in its cluster (i.e. the other processor of the same type, in the other virtual CPU is powered down). It then waits for a signal from the outbound processor.

Once the outbound processor receives "inbound alive" signal, the blackout period (i.e. time when no kernel or user code is running on the virtual CPU) begins. The outbound processor disables interrupts, migrates the interrupt signals to the inbound processor, then saves the current CPU context. Once that's done, it signals the inbound processor, which restores the context, enables interrupts, and continues executing from where the outbound processor left off. All of that is possible because the instruction sets of the two processors are identical.

As part of its cleanup, the outbound processor creates a new stack for itself so that it won't interfere with the inbound. It then flushes the local cache and checks to see if it is the last one standing in its cluster; if so, it flushes the cluster cache and disables the cache-coherent interconnect. It then powers itself off.

There are some pieces missing from the picture that he painted, Poirier said, including "vlocks" and other mutual exclusion mechanisms to handle simultaneous desired cluster power states. Also missing was discussion of the "early poke" mechanism as well as code needed to track the CPU and cluster states.

Performance

One of Linaro's main targets is Android, so it used the interactive power governor for its testing. Any governor will work, he said, but will need to be tweaked. A second threshold (hispeed_freq2) was added to the interactive governor to delay going into "overdrive" on the A15 too quickly as those are "very power hungry" states.

For testing, BBench was used. It gives a performance score based on how fast web pages are loaded. That was run with audio playing in the background. The goal was to get 90% of the performance of two A15s, while using 60% of the power, which was achieved. Different governor parameters gave 95% performance with 65% of the power consumption.

It is important to note that tuning is definitely required—without it you can do worse than the performance of two A7s. "If you don't tune, all efforts are wasted", Poirier said. The interactive governor has 15-20 variables, but Linaro mainly concentrated on hispeed_load and hispeed_freq (and the corresponding *2 parameters added for handling overdrive). The basic configuration had the virtual CPU run on the A7 until the load reached 85%, when it would switch to the first six (i.e. non-overdrive) frequencies on the A15. After 95% load, it would use the two overdrive frequencies.

The upstreaming process has started, with the cluster power management code getting "positive remarks" on the ARM Linux mailing list. The goal is to upstream the code entirely, though some parts of it are only available to Linaro members at the moment. The missing source will be made public once a member ships a product using IKS. But, IKS is "just a stepping stone", Poirier said, and "HMP will blow this out of the water". It may take a while before HMP is ready, though, so IKS will be available in the meantime.

[ I would like to thank the Linux Foundation for travel assistance to attend ELC. ]

Comments (1 posted)

Loading keys from Microsoft PE binaries

By Jonathan Corbet
February 27, 2013
The kernel does not run programs in Microsoft's Portable Executable (PE) format. So when a patch came along adding support for those binaries — not to run programs, but to use them as a container for trusted keys — the reaction was not entirely positive. In truth, the reaction was sufficiently negative to be widely quoted across the net. When one looks beyond the foul language, though, there are some fundamental questions about how Linux should support the UEFI secure boot mechanism and how much the kernel community needs to be concerned about Microsoft's wishes in this area.

The work done at Red Hat, SUSE, the Linux Foundation, and elsewhere is sufficient to enable a distributor to ship a binary distribution that will boot on a secure-boot-enabled system. Such distributions are often built so that they will only load kernel modules that have been signed by a trusted key, normally the distributor's own key. That restriction naturally causes problems for companies that ship binary-only modules; such modules will not be loadable into a secure-boot system. Many developers in the kernel community are not overly concerned about this difficulty; many of them, being hostile to the idea of binary-only modules in the first place, think this situation is just fine. Distributors like Red Hat, though, are not so sanguine.

One solution, of course, would be for those distributors to just sign the relevant binary modules directly. As Matthew Garrett points out, though, there are a number of practical difficulties with this approach, including the surprisingly difficult task of verifying the identity and trustworthiness of the company shipping the module. There's also the little problem that signing binary-only modules might make Red Hat look bad in various parts of our community and give strength to those claiming that such modules have no GPL compliance problems. So Red Hat would like to find a way to enable proprietary modules to be loaded without touching them directly, allowing the company to pretend not to be involved in the whole thing.

Red Hat's solution is to convince the kernel to trust any signing key that has been signed by Microsoft. Binary module vendors could then go to Microsoft to get their own key signed and present it to the kernel as being trustworthy; the kernel would then agree to load modules signed with this key. This only works, of course, if the kernel already trusts Microsoft's key, but that will be the case for all of the secure boot solutions that exist thus far. There is one other little problem in that the only thing Microsoft will sign is a PE binary. So Red Hat's scheme requires that the vendor's key be packaged into a PE binary for Microsoft to sign. Then the kernel will read the binary file, verify Microsoft's signature, extract the new key, and add that key to the ring of keys it trusts. Once that is done, the kernel will happily load modules signed by the new key.

This solution seems almost certain not to find its way into the mainline kernel. In retrospect, it is unsurprising that a significant patch that is seen as simultaneously catering to the wishes of Microsoft and binary module vendors would run into a bit of resistance. That is even more true when there appear to be reasonable alternatives, such as either (1) having Red Hat sign the modules directly, or (2) having Red Hat sign the vendor keys with its own key. Such solutions are unpopular because, as mentioned above, they reduce Red Hat's plausible deniability; they also make revocation harder and almost certainly require vendors to get a separate signature for each distribution they wish to support.

Linus has made it clear that he is not worried about those problems, though. Security, he says, should be in the control of the users; it should not be a mechanism used to strengthen a big company's control. So, rather than wiring Microsoft's approval further into the kernel, he would rather that distributors encourage approaches that educate users, improve their control, and which, he says, would ultimately be more secure. Loading a module in this environment, he said, would be a matter of getting the user to verify that the module is wanted rather than verifying a signing key.

The other reason that this patch is running into resistance is that there is widespread skepticism of the claim that the loading of unsigned modules must be blocked in the first place. Proponents claim that module signing (along with a whole set of other restrictions) is needed to prevent Linux from being used as a way to circumvent the secure boot mechanism and run compromised versions of Windows. Microsoft, it is said, will happily blacklist the Linux bootloader if Linux systems are seen as being a threat to Windows systems. Rather than run that risk, Linux, while running under secure boot, must prevent the running of arbitrary kernel code in any way. That includes blocking the loading of unsigned kernel modules.

It seems that not all kernel developers are worried about this possibility. Greg Kroah-Hartman asserted that module signature verification is not mandated by UEFI. Ted Ts'o added that Microsoft would suffer public relations damage and find itself under antitrust scrutiny if it were to act to block Linux from booting. It also seems unlikely to some that an attacker could rig a system to boot Linux, load a corrupted module, then chain-boot into a corrupted Windows system without the user noticing. For all of these reasons, a number of developers seem to feel that this is a place where the kernel community should maybe push back rather than letting Microsoft dictate the terms under which a system can boot on UEFI hardware. But some of Red Hat's developers, in particular, seem to be genuinely afraid of the prospect of a key revocation by Microsoft; Dave Airlie put it this way:

Its a simple argument, MS can revoke our keys for whatever reason, reducing the surface area of reasons for them to do so seems like a good idea. Unless someone can read the mind of the MS guy that arbitrarily decides this in 5 years time, or has some sort of signed agreement, I tend towards protecting the users from having their Linux not work anymore...

Others counter that, if Microsoft can revoke keys for any reason, there is little to be done to protect the kernel in any case.

In the end, this does not appear to be an easy disagreement to resolve, though some parts are easy enough: Linus has refused to accept the key-loading patch, so it will not be merged. What may well happen is that the patch will drop out of sight, but that distributors like Red Hat will quietly include it in their kernels. That will keep this particular disagreement from returning to the kernel development list, but it does little to resolve the larger question of how much Linux developers should be driven by fear of Microsoft's power as they work to support the UEFI secure boot mechanism.

Comments (42 posted)

Patches and updates

Kernel trees

Build system

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Security-related

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds