Kernel development [LWN.net]

Kernel release status

The 3.15 merge window remains open, so there is no development kernel as of this writing. See the article below for a summary of what has been merged thus far.

Stable updates: 3.13.9, 3.10.36, and 3.4.86 were released on April 3; 3.12.17 came out on April 8.

Comments (none posted)

Quotes of the week

We haven't quite made it for 2.5, but maybe we can merge this for 3.15.

— Arnd Bergmann finally gets rid of sleep_on()

It's been a long time since "all the world's an i386" --- these days, it's "all the world's an x86_64".

— Ted Ts'o

For example, the kernel uses debug ("dee-bug") to mean log everything to the console, where systemd uses the debug from the Scandinavian "day-boog" meaning "fail to boot". If a future versions uses argv[] instead of reading /proc/cmdline, this confusion will be avoided.

— Rusty Russell

Comments (7 posted)

3.15 Merge window, part 1

By Jonathan Corbet
April 9, 2014

As of this writing, 11,321 non-merge changesets have been pulled into the mainline repository during the 3.15 merge window. That makes 3.15 one of the busiest development cycles ever, though it has not yet surpassed 3.10, which saw 11,963 changes pulled during the merge window. Despite all these changes, the list of new features is not as big or as impressive as one might expect; much of the work being merged is under-the-covers cleanup and restructuring.

That said, there is still a lot of interesting stuff brewing for 3.15. User-visible changes in this release will include:

The latency tolerance patches have been added to the power-management quality-of-service subsystem. This code allows the kernel (or user space) to communicate latency requirements to peripheral devices, which should use that information to avoid going into overly deep sleep states.
The active/inactive list balancing patch set has found its way into the memory management subsystem at last. This work tries to detect situations where the kernel is pushing pages out of memory, only to fault them back in shortly thereafter; when that happens, the sizes of the LRU lists are adjusted in an attempt to improve the situation. The result should be improved performance for workloads with large working sets.
The new renameat2() system call adds the ability to atomically exchange two files. There is also a RENAME_NOREPLACE flag that prevents a rename operation from replacing an existing file.
The file-private POSIX locks feature has been merged.
The FUSE (filesystems in user space) subsystem can now perform writeback caching, improving performance on write-heavy workloads.
The UBI flash translation layer has gained a driver that can make a flash device appear to be a (read-only, for now) block device. That enables the use of any filesystem on top of a raw flash device.
The ext4 and XFS filesystems now support the new FALLOC_FL_ZERO_RANGE and FALLOC_FL_COLLAPSE_RANGE operations. XFS also has added support for the O_TMPFILE flag.
The device mapper has a new "dm-era" target that can maintain a list of blocks changed during a user-defined period of time. See Documentation/device-mapper/era.txt for more information.
The device tree information found in /proc/device-tree has been removed. That same information is available in sysfs under /sys/firmware/devicetree/base, so /proc/device-tree is now a symbolic link to that location.
The ipset packet filtering interface has a new "hash:ip,mark" set type for matching packets with specific marks added by higher-level filtering tools.
The just-in-time compiler for BPF-based packet filtering code has been extensively reworked, with a different instruction set. See the "kernel internals" section of Documentation/networking/filter.txt for details.
The function tracer can now be used within multi-buffer trace instances, allowing each instance to trace a different set of function calls.
User-space probing with uprobes is now supported on the ARM architecture.
The per-thread VMA caching patch set has been merged; it should improve memory management performance for a number of workloads.
The zram compressed in-memory swap mechanism can now optionally use LZ4 compression.
The Tile architecture now supports the perf events subsystem.
Support for the ancient Unisys ES7000, IBM Summit/EXA, SGI Visual Workstation, and NUMAQ x86 subarchitectures has been removed, as has support for PowerPC-based Motorola PrPMC2800 boards.
New hardware support includes:
- Processors and systems: MIPS systems using the Coherent Processing System architecture, Loongson 3 processors, Marvell Armada 375, 380, and 385 systems, and Broadcom BCM470X and BCM5301X systems. Note that support for numerous ARM-based boards has also been added, but that support consists entirely of device tree changes. Plans to move the device tree data out of the kernel tree still exist, but keep getting pushed back.
- Audio: Texas Instruments tlv320aic31xx codecs, TI PCM512x codecs, Analog Devices ADAU1977, ADAU1978 and ADAU1979 audio codecs, Cirrus Logic CS42448/CS42888 codecs, Intel "Smart Sound Technology" devices, Intel Haswell Lynxpoint DSPs, Intel Baytrail/RT5640 codecs, and SiRF internal codecs.
- Block: Allwinner A10/A20 AHCI SATA controllers, APM X-Gene AHCI SATA controllers, and DaVinci DA850 AHCI SATA controllers.
- Graphics: NXP PTN3460 DisplayPort-to-LVDS bridges, Samsung EXYNOS DRM MIPI-DSI devices, LD9040 RGB/SPI panels, and S6E8AA0 DSI video mode panels.
- Hardware monitoring: Linear Technology LTC2945 I2C system monitors, Linear Technology LTC4261 positive voltage hot swap controller I2C interfaces, Linear Technology LTC4222 dual hot swap controller I2C interfaces, and Texas Instruments ADC128D818 system monitors.
- Input: Cirrus Logic CLPS711X matrix keypads and GPIO buttons on Intel Bay Trail-based tablets.
- Miscellaneous: NVIDIA Tegra watchdog timers, devices connected via the MEN Chameleon Bus including MEN 16z188 analog-to-digital converters, TI Asynchronous External Memory Interface controllers, Silabs Si7005 relative humidity and temperature sensors, Lite-On LTR-501ALS-01 ambient light and proximity sensors, Freescale vf610 analog-to-digital converters, Xilinx analog-to-digital converters, Keithley Metrabyte DAC02 compatible ISA cards, Silicon Labs CP2112 HID USB-to-SMBus bridges, Dallas/Maxim DS1347 realtime clocks, LSI ZEVIO SoC memory mapped GPIO controllers, Synopsys DesignWare APB GPIO controllers, ARM Cirrus Logic CLPS711X SYSFLG1 MCTRL GPIO controllers, Freescale FlexTimer Module PWM controllers, Cirris Logic CLPS711X PWM controllers, Intel LPSS PWM controllers, and Realtek USB 2.0 card readers.
- Networking: Bluetooth HCI controllers with Nokia H4 extensions, Broadcom 7xxx SOCs internal PHYs, Broadcom GENET internal MACs, Realtek RTL8723BE PCIe wireless network adapters, Texas Instruments TRF7970a NFC controllers, Redpine Signals 91x WLAN adapters, Altera Triple-Speed Ethernet MACs, Samsung SXGBE 10G Ethernet controllers, and Altera SOCFPGA Ethernet controllers.
- Power: Broadcom BCM590xx PMU regulators, Samsung S2MPA01 and S2MPS14 voltage regulators, TI TPS65218 power management chips, Broadcom BCM590xx power management units, and devices connected via the System Power Management Interface spec, including Qualcomm MSM SPMI controllers.
- SPI: Qualcomm QUP SPI controllers, Allwinner A10 and A31 SPI controllers, and Xtensa xtfpga SPI controllers.
- USB: Exynos5250 Sata SerDes PHYs, Samsung USB 2.0 PHYs, Allwinner sunxi SoC USB PHYs, and Realtek RTL8723AU USB wireless network adapters.
- Video4Linux: Micronas DRX-J demodulators, TI LM3646 dual flash devices, ImgTec infrared decoders, Mirics MSi001 silicon tuners, Realtek RTL2832 silicon tuners, Samsung S5K6A3 sensors, and EXYNOS4x12 FIMC-IS ISP direct DMA capture interfaces.

Changes visible to kernel developers include:

Rather later than anybody might have expected, the sleep_on() family of functions has been removed from the kernel.
There is a new "locktorture" module which performs various types of stress testing on kernel locking primitives.
The kernel address-space layout randomization code has been extended to randomize the base address for loadable modules. A single random offset is chosen once by the kernel and used with each module as it is loaded.
The arm64 architecture now has support for the KGDB kernel debugger.
Basic CPU topology support has been added to the arm64 architecture, allowing the kernel to represent the system's architecture as described by the firmware.
The PREPARE_WORK() and DELAYED_WORK() workqueue macros have been removed. The interface was prone to subtle errors and was never used widely within the kernel.
The timer broadcast patches have been merged, allowing the delivery of timer events to a sleeping CPU even if that CPU's timers stop while it is in the sleep state.
The rewriting of the core control group code continues. Changes merged this time around include a full transition to the new "kernfs" virtual filesystem for control files, some steps toward the unified hierarchy model, and the removal of the ability to build controllers as modules.
There is a new method (map_pages()) in struct vm_operations_struct; its job is to perform opportunistic "fault around" mapping of pages, hopefully reducing page faults and improving performance. Note that map_pages() is not allowed to block. The page cache uses this function to map surrounding pages on page faults. A new debugfs knob (fault_around_order) enables playing with and tuning this functionality.

The merge window can be expected to remain open until around April 13. At this point, though, most of the major trees have been pulled, so there probably will not be a lot of changes showing up in the last few days. Perhaps the biggest remaining question mark is the support for link-time optimization (LTO). This toolchain feature has the potential to improve kernel performance while reducing its total size; this happens at the cost of an increased build time. Linus is unconvinced by the merits of this patch set and is asking for more information. A number of other developers have asked for its inclusion, but it is not yet clear whether that will be enough to turn the tide.

Tune in next week for a summary of the final changes merged for this development cycle.

Comments (none posted)

Lots of new perf features

By Jake Edge
April 9, 2014

Collaboration Summit

New features for the perf tracing tool and its infrastructure were the topic of two talks squeezed into one slot at the 2014 Linux Foundation Collaboration Summit. Red Hat's Jiri Olsa and LG Electronics's Namhyung Kim looked at both recent changes that have gone into perf as well as those that are still pending. Each looked at a different set of improvements to perf, including those they created themselves and those that came from other developers.

Olsa started things off by describing the libtraceevent plugins that will help in parsing the tracepoint format lines. libtraceevent itself was written by Steven Rostedt as part of the trace-cmd front-end for ftrace, but is now used by perf as well. Each kernel tracepoint has formatting information in its format file ("print fmt") that describes the format used by the kernel to output the data from the tracepoint, but that line is difficult for user space code to use, he said. Plugins will allow user space to easily parse that information to produce the same output as the kernel.

Intel processor feature support

Support for the "running average power limit" (RAPL) feature in Intel processors was next up. It allows administrators to set and monitor power consumption limits for various hardware domains in the system. With this addition, authored by Stephane Eranian, perf can sample and report on power consumption in several different categories: all physical cores, the processor package, the DRAM, or the built-in GPU.

Another Intel-only feature allows perf to do memory profiling on those systems. Recent Intel CPUs provide a way to see the individual loads and stores of memory. Each of those events has additional information associated with it, including the instruction address, data address, location of access (L1 cache, local RAM, or remote cache, for example), and type of translation lookaside buffer (TLB) access (hit, miss, L1, ...). There is also information on the "weight" of the access, which is the amount of work in cycles that the processor spent on the memory access. The weight values are not particularly reliable, Olsa said, but are getting better with each new processor.

Perf itself only records the memory access information, it does not try to analyze it. Another tool, c2c (for "cache to cache"), tries to detect cache-line sharing between processors, which is a good thing to avoid, he said. It will report on each cache line, giving the address of the line, the type of access (load/store), the offset of the access, and the instruction that caused the access. With that information, developers should be able to spot cache-line bouncing in their systems.

Better backtrace

Improvements to the backtrace output from ftrace was next on Olsa's list. The idea is to be able to see the call chain that led up to a particular sample. That is currently done using frame pointers. If frame pointers are not available, the newly added libdw can use the DWARF debugging data format to unwind the stack and produce a backtrace. Perf can record the user stack and registers, then use libdw to provide a backtrace at report time, which is faster than using libunwind as was done previously.

An even faster mechanism to produce backtraces uses the "last branches record" (LBR) information. Enabling LBR will store a list of the last branches taken by the code each time a sample is taken. Both the from and to addresses are stored and that information can be used to produce a backtrace.

Future changes

All of the preceding changes have gone into perf relatively recently; Olsa then turned to some features that can be expected in the future. There is a plan to support the common trace format (CTF) in perf; the other user of CTF currently is the Linux trace toolkit next generation (LTTng). Making "perf record" multi-threaded is also on the drawing board. That will require per-CPU storage for perf data. Supporting multiple output data files (with a maximum size per file) is planned as well.

Some bigger features that are coming include a mechanism to allow events to toggle other events on or off to help narrow down the measured area. The original code was from Frédéric Weisbecker, but it is missing a suitable user interface, Olsa said. Currently it has an "ugly" command-line interface.

Currently running "perf record" requires an open file for each CPU for each event. For traces with lots of events on systems with lots of CPUs, the maximum open file descriptor limit will be hit. So there are plans to allow for "event groups" that would share a single file descriptor. This idea has only been discussed, Olsa said, no code has been posted.

Generic code will be moved out of perf itself and into the kernel tools/lib directory so that other tools have access to it. In closing, Olsa mentioned that there are 26 or so automated tests for perf that get run each time a commit is made. That test suite is getting bigger over time.

Output changes

With that, Kim took over to discuss even more perf enhancements. A change to "perf report" will allow users to see the caller information more clearly, he said. The --children option will show the full call chain and report each member of the call chain's impact on the measured quantity. It adds "Children" and "Self" columns to the perf output showing the cumulative contribution at each step of the call chain.

Another output change is the --fields option, which specifies a comma-separated list of the columns that will be in the output. It works with the existing --sort option, allowing users to specify which columns to sort and which to output.

Ftrace support

Integrating ftrace support into perf is another new feature. The idea is to integrate the tool side of ftrace into perf, Kim said. Currently, only the function and function_graph tracers are supported. The feature is using the libtraceevents kbuffer API to access the events. Providing perf-like behavior using ftrace events is the goal.

To use it, one does a perf ftrace record. After the recording, one can get ftrace-like output using perf ftrace show or perf-like output using perf ftrace report. It essentially allows the user to access the ftrace ring buffer and its events from perf. More tracers will be added over time.

DWARF support has been added to uprobes, which allows perf to place dynamic tracepoints using symbolic names and line numbers. That work was done by Masami Hiramatsu and is already upstream, Kim said. Using the --line and --vars options will use the debug information in a binary to create uprobe tracepoints that perf can sample for those locations.

Statically defined tracepoints

The final new feature that Kim covered was statically defined tracepoints (SDTs), which are similar to the kernel's tracepoints, but specified for user-space applications. SDT is the same format used by DTrace and multiple applications have already added tracepoints in that format. With the addition of SDT support, perf can access and sample these SDTs. There is support for listing which SDTs are available in a program and support for processing them using uprobes. Support for SDT arguments and for treating SDTs like normal perf events is coming, Kim said.

As can be seen, there is a lot of activity going on in the perf world. Much of it is straightforward improvements as well as support for new processor features, but the ftrace integration and event toggling will be fairly substantial new pieces. Making it all work with user-space programs and DTrace tracepoints are also nice benefits. For Linux tracing, most of the facilities are now in place, but the interfaces are still geared toward advanced developers and kernel hackers—simpler interfaces would seem to be an area that needs more attention in the future.

[ Thanks to the Linux Foundation for supporting my travel to the Collaboration Summit. ]

Comments (1 posted)

Sealed files

By Jonathan Corbet
April 9, 2014

Interprocess communication using shared memory can be easy and efficient, but there is a catch: all of the processes involved must be able to trust each other. They must be able to assume that their peers will not modify memory contents after making them available; otherwise, no end of mischief is possible in the time between when memory contents are checked for sanity and when they are actually used. Similarly, processes need to trust that their peers will not truncate the file backing up a shared memory region at an inopportune time, causing fatal signals when they try to access parts of the region that no longer exist. In the real world, where this kind of trust is often not present, careful users of shared memory must copy data before using it and be prepared for signals; that kind of programming is cumbersome, error-prone, and slow.

Developers have been talking about coming up with a solution to these problems for a bit; this discussion took the form of real code in mid-March, when David Herrmann posted his file sealing patch set. The sealing concept allows one party in a shared-memory transaction to work with a memory segment in the knowledge that it cannot be changed by another process at an inopportune time.

A process working with the sealing API will start by creating a file on an shmfs filesystem and mapping it into memory. That memory region is then filled with whatever data the process wants to pass to a another process on the system. When the segment is ready to be handed over, the process can seal it with a call to fcntl() using the new SHMEM_SET_SEALS command. There are three types of seal that can be set on a file:

SEAL_SHRINK prevents the file from being reduced in size.
SEAL_GROW disallows file growth.
SEAL_WRITE prevents all modifications except resizing.

If all three seals are set, then the file becomes immutable. Seals cannot be set on a file that has writable mappings; the creating process must remove all such mappings with munmap() before the fcntl() call.

Once the file is sealed, the associated file descriptor can be passed to the peer process, which can verify that the seals are in place using the SHMEM_GET_SEALS fcntl() operation. If the seals are there, the recipient knows that the file (and associated shared memory region) cannot be changed in the indicated ways. That makes the use of zero-copy techniques much safer, and avoids a number of other potential issues as well.

The actual enforcement of the seals is done within the shmfs filesystem. It is not hard to augment the calls implementing write() and truncate() to check for the existence of a seal and fail with EPERM should a seal exist. Since (as mentioned above) no writable mappings can exist when a seal is applied, all that is needed to prevent modification through memory mappings is a check in the mmap() implementation when a writable mapping is requested. It appears that the kernel can indeed credibly promise that a sealed file will not be changed in the indicated ways.

One might argue that the potential for shared-memory mischief has just been replaced with the potential for seal-related attacks instead. The feature has been developed with an eye toward preventing such attacks, though. To start with, only shmfs supports sealing, so there should be no issues with hostile processes setting seals on real files. Once the initial seals have been set, they can only be changed by a process that has an exclusive reference to the file. So a recipient process can verify that the seals are in place knowing that they cannot be removed as long as it holds its own reference to the file. So it should not be possible to perform denial-of-service attacks by placing seals on random files, and seals cannot be changed while another process is counting on their protection.

For those who do not want to mount shmfs and work with files explicitly, there is also a new system call:

    int memfd_create(const char *name, u64 size, u64 flags);

This call will create a file (not visible in the filesystem) that is suitable for sealing; it will be associated with the given name and be size bytes in length. The return value on success will be a file descriptor associated with the newly created file. The only recognized flag is MFD_CLOEXEC, which maps to O_CLOEXEC internally, causing the file descriptor to be automatically closed if the process calls one of the forms of exec(). The returned file descriptor can be passed to mmap(), of course.

Most commenters seemed happy enough with the proposed functionality, but there were a number of questions about the implementation and the semantics. Linus didn't like the rules regarding when seals could be changed; he suggested that, instead, only the creator of a file should be allowed to seal it. David is not averse to doing things that way; if sealing were made into a one-time operation, the reference counting on files could be eliminated entirely. He might also add a new flag (MFD_ALLOW_SEALING) to memfd_create() and restrict the sealing operations to files created with that flag set.

Ted Ts'o, instead, suggested that sealing should not be limited to shmfs files. Instead, he would like to see consideration given to implementing this functionality in the virtual filesystem layer so that sealing could be used with files from any filesystem. David responded that he didn't see a use case for sealing in any other context, but Ted would still like to see a more general use for this functionality. This part of the conversation wound down without any resolution.

There are a number of fairly immediate use cases for the sealing functionality in general. Graphics drivers could use it to safely receive buffers from applications. The upcoming kdbus transport can benefit from sealing. The Android "ashmem" allocator also implements a similar feature that could be moved over once this code gets upstream. So the chances of this functionality being merged into the mainline are fairly good, even though the details of how things will work have not yet been sealed in place.

Comments (11 posted)

Greg KH Linux 3.13.9 ?

Jiri Slaby Linux 3.12.17 ?

Luis Henriques Linux 3.11.10.7 ?

Greg KH Linux 3.10.36 ?

Kamal Mostafa Linux 3.8.13.21 ?

Luis Henriques Linux 3.5.7.33 ?

Greg KH Linux 3.4.86 ?

Alex Elder ARM: SMP: support Broadcom mobile SoCs ?

Haojian Zhuang support Hisilicon HiP04 ?

Thomas Petazzoni SMP support for Armada 375 and 38x ?

Leif Lindholm arm64: UEFI support ?

Jean Pihet perf, persistent: Add persistent events ?

Tejun Heo cgroup: implement cgroup.populated ?

Steven Rostedt rwsem: The return of multi-reader PI rwsems ?

Dave Chinner xfstests: updated to cf1ed54 ?

Christopher Li Sparse 0.5.0 ?

Tim Kryger Add Broadcom Kona PWM Support ?

Struan Bartlett netconsole: Add tty driver ?

Stanimir Varbanov Add Qualcomm crypto driver ?

Linus Walleij MFD: driver for Atmel Microcontroller on iPaq h3xxx ?

Pantelis Antoniou Introducing (yet again) Device Tree Overlays ?

Zhangfei Gao add hisilicon hip04 ethernet driver ?

Matt Roper Universal plane preparation patches ?

Stephen Boyd Support qcom GDSC hardware ?

Stephen Boyd Krait L1/L2 EDAC driver ?

Jason Baron Add new ie31200_edac driver ?

Vivek Gautam Add Exynos5 USB 3.0 phy driver based on generic PHY framework ?

Tomasz Stanislawski phy: Add exynos-simple-phy driver ?

Santiago Leon New driver for IBM System i/p VNIC protocol ?

David Cohen Initial implementation of Intel MID watchdog driver ?

Michal Malý Introduce ff-memless-next as an improved replacement for ff-memless ?

Srinivas Pandruvada devres: introduce API "devm_kmemdup" ?

Shuah Khan managed token devres interfaces ?

Michael Kerrisk (man-pages) For review: open_by_handle_at(2) man page [v4] ?

Michael Kerrisk (man-pages) man-pages-3.64 is released ?

Christoffer Dall ARM VM System Specification ?

Apelete Seketeli Add documentation on writing an musb glue layer ?

Peter Zijlstra sched_{set,get}attr() manpage ?

Miklos Szeredi renameat2 syscall ?

Lukas Czerner fsx: Add fallocate collapse range operation ?

Zheng Liu vfs: add closefrom(2) syscall ?

Liu Bo Online(inband) data deduplication ?

Luiz Capitulino hugetlb: add support gigantic page allocation at runtime ?

Yu Zhao Per-cgroup swap file support ?

Minchan Kim support madvise(MADV_FREE) ?

Mel Gorman Use an alternative to _PAGE_PROTNONE for _PAGE_NUMA ?

Mel Gorman Disable zone_reclaim_mode by default ?

Pablo Neira Ayuso new transaction infrastructure for nf_tables (v3) ?

Douglas Gilbert sg3_utils-1.38 available ?

Lucas De Marchi kmod 17 ?

Kernel development

Brief items

Kernel release status

Quotes of the week

Kernel development news

3.15 Merge window, part 1

Lots of new perf features

Intel processor feature support

Better backtrace

Future changes

Output changes

Ftrace support

Statically defined tracepoints

Sealed files

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Miscellaneous