Kernel development [LWN.net]

Kernel release status

The 3.14 merge window remains open, so there is no current development kernel release.

Stable updates: 3.12.9 and 3.10.28 were released on January 25, followed by 3.13.1 and 3.4.78 on January 29.

Quotes of the week

If most of the oopses you decode are on your own machine with your own kernel, you might want to try to learn to be more careful when writing code. And I'm not even kidding.

— Linus Torvalds

Because I've been using tmpfs as build target for a while, I've been experiencing this occasionally and secretly growing bitter disappointment towards the linux kernel which developed into self-loathing to the point where I found booting into win8 consoling after looking at my machine stuttering for 45mins while it was repartitioning the hard drive to make room for steamos. Oh the irony. I had to stay in fetal position for a while afterwards. It was a crisis.

— Tejun Heo

Perhaps we could also generate the most common variants as:

 #define PERM__rw_r__r__		0644
 #define PERM__r________		0400
 #define PERM__r__r__r__		0444
 #define PERM__r_xr_xr_x		0555

— Ingo Molnar replaces S_IRUGO and friends.

Comments (20 posted)

Gorman: LSF/MM 2014 so far

Mel Gorman, chair of the 2014 Linux Storage, Filesystem, and Memory Management Summit notes that the CFP deadline is approaching and that the event is shaping up nicely. "I am pleased to note that there are a number of new people sending in attend and topic mails. The long-term health of the community depends on new people getting involved and breaking through any perceived barrier to entry. At least, it has been the case for some time that there is more work to do in the memory manager than there are people available to do it. It helps to know that there are new people on the way." Anybody wanting to attend who has not yet sent in a proposal should not delay much further.

Comments (none posted)

3.14 Merge window part 2

By Jonathan Corbet
January 29, 2014

As of this writing, almost 8,600 non-merge changesets have been pulled into the mainline repository for the 3.14 development cycle — 5,300 since last week's merge window summary. As can be seen from the list below, quite a bit of new functionality has been added to the kernel in the last week. Some of the more significant, user-visible changes merged include:

The event triggers feature has been added to the tracing subsystem. See this commit for some information on how to use this feature.
The user-space probes (uprobes) subsystem has gained support for a number of "fetch methods" providing access to data on the stack, from process memory, and more. See the patchset posting for more information.
The Xen paravirtualization subsystem has gained support for a "paravirtualization in an HVM container" (PVH) mode which makes better use of hardware virtualization extensions to speed various operations (page table updates, for example).
The ARM architecture can be configured to protect kernel module text and read-only data from modification or execution. The help text for this feature notes that it may interfere with dynamic tracing.
The new SIOCGHWTSTAMP network ioctl() command allows an application to retrieve the current hardware timestamping configuration without changing it.
"TCP autocorking" is a new networking feature that will delay small data transmissions in the hope of coalescing them into larger packets. The result can be better CPU and network utilization. The new tcp_autocorking sysctl knob can be used to turn off this feature, which is on by default.
The Bluetooth Low Energy support now handles connection-oriented channels, increasing the number of protocols that can work over the LE mode. 6LoWPAN emulation support is also now available for Bluetooth LE devices.
The Berkeley Packet Filter subsystem has acquired a couple of new user-space tools: a debugger and a simple assembler. See the newly updated Documentation/networking/filter.txt for more information.
The new "heavy-hitter filter" queuing discipline tries to distinguish small network flows from the big ones, prioritizing the former. This commit has some details.
The "Proportional Integral controller Enhanced" (PIE) packet scheduler is aimed at eliminating bufferbloat problems. See this commit for more information.
The xtensa architecture code has gained support for multiprocessor systems.
The Ceph distributed filesystem now has support for access control lists.
New hardware support includes:
- Processors and systems: Marvell Berlin systems-on-chip (SoCs), Energy Micro EFM32 SoCs, MOXA ART SoCs, Freescale i.MX50 processors, Hisilicon Hi36xx/Hi37xx processors, Snapdragon 800 MSM8974 SoCs, Systems based on the ARM "Trusted Foundations" secure monitor, Freescale TWR-P102x PowerPC boards, and Motorola/Emerson MVME5100 single board computers.
- Clocks: Allwinner sun4i/sun7i realtime clocks (RTCs), Intersil ISL12057 RTCs, Silicon Labs 5351A/B/C programmable clock generators, Qualcomm MSM8660, MSM8960, and MSM8974 clock controllers, and Haoyu Microelectronics HYM8563 RTCs.
- Miscellaneous: AMD cryptographic coprocessors, Freescale MXS DCP cryptographic coprocessors (replacement for an older, unmaintained driver), OpenCores VGA/LCD core 2.0 framebuffers, generic GPIO-connected beepers, Cisco Virtual Interface InfiniBand cards, Active-Semi act8865 voltage regulators, Maxim 14577 voltage regulators, Broadcom BCM63XX HS SPI controllers, and Atmel pulse width modulation controllers.
- Multimedia Card (MMC): Arasan SDHCI controllers and Synopsys DesignWare interfaces on Hisilicon K3 SoCs.
- Networking: Marvell 8897 WiFi and near-field communications (NFC) interfaces, Intel XL710 X710 Virtual Function Ethernet controllers, and Realtek RTL8153 Ethernet adapters.
Note also that the AIC7xxx SCSI driver, deprecated since the 2.4 days, has finally been removed from the kernel.

Changes visible to kernel developers include:

The ARM architecture code can be configured to create a file (kernel_page_tables) in the debugfs filesystem where the layout of the kernel's page tables can be examined.
The checkpatch script will now complain about memory allocations using the __GFP_NOFAIL flag.
There is a new low-level library for computing hash values in situations where speed is more important than the quality of the hash; see this commit for details.

At this point, the 3.14 merge window appears to be winding down. If the usual two-week standard applies, the window should stay open through February 2, but Linus has made it clear in the past that the window can close earlier if he sees fit. Either way, next week's Kernel Page will include a summary of the final changes pulled for this development cycle.

Comments (5 posted)

Preparing for large-sector drives

By Jonathan Corbet
January 29, 2014

Back in the distant past (2010), kernel developers were working on supporting drives with 4KB physical sectors in Linux. That work is long since done, and 4KB-sector drives work seamlessly. Now, though, the demands on the hard drive industry are pushing manufacturers toward the use of sectors larger than 4KB. A recent discussion ahead of the upcoming (late March) Linux Storage, Filesystem and Memory Management Summit suggests that getting Linux to work on such devices may be a rather larger challenge requiring fundamental kernel changes — unless it isn't.

Ric Wheeler started the discussion by proposing that large-sector drives could be a topic of discussion at the Summit. The initial question — when such drives might actually become reality — did not get a definitive answer; drive manufacturers, it seems, are not ready to go public with their plans. Clarity increased when Ted Ts'o revealed a bit of information that he was able to share on the topic:

In the opinion of at least one drive vendor, the pressure for 64k sectors will start increasing (roughly paraphrasing that vendor's engineer, "it's a matter of physics"), and it might not be surprising that in 2 or 3 years, we might start seeing drives with 64k sectors.

Larger sectors would clearly bring some inconvenience to kernel developers, but, since they can help drive manufacturers offer more capacity at lower cost, they seem almost certain to show up at some point.

Do (almost) nothing

One possible response, espoused by James Bottomley, is to do very little in anticipation of these drives. He pointed out that much of the work done to support 4KB-sector drives was not strictly necessary; the drive manufacturers said that 512-byte transfers would not work on such drives, but the reality has turned out to be different. Not all operating systems were able to adapt to the 4KB size, so drives have read-modify-write (RMW) logic built into their firmware to handle smaller transfers properly. So Linux would have worked anyway, albeit with some performance impact.

James's point is that the same story is likely to play out with larger sector sizes; even if manufacturers swear that only full-sector transfers will be supported, those drives will still, in the end, have to work with popular operating systems. To do that, they will have to support smaller transfers with RMW. So it comes down to what's needed to perform adequately on those drives. Large transfers will naturally include a number of full-sector chunks, so they will mostly work already; the only partial-sector transfers would be the pieces at either end. Some minor tweaks to align those transfers to the hardware sector boundary would improve the situation, and a bit of higher-level logic could cause most transfers to be sized to match the underlying sector size. So, James said:

I'm asking what can we do with what we currently have? Increasing the transfer size is a way of mitigating the problem with no FS support whatever. Adding alignment to the FS layout algorithm is another. When you've done both of those, I think you're already at the 99% aligned case, which is "do we need to bother any more" territory for me.

But Martin Petersen, arguably the developer most on top of what manufacturers are actually doing with their drives, claimed that, while consumer-level drives all support small-sector emulation with RMW, enterprise-grade drives often do not. If the same holds true for larger-sector drives, the 99% solution may not be good enough and more will need to be done.

Larger blocks in the kernel

There are many ways in which large sector support could be implemented in the kernel. One possibility, mentioned by Chris Mason, would be to create a mapping layer in the device mapper that would hide the larger sector sizes from the rest of the kernel. This option just moves the RMW work into a low-level kernel layer, though, and does nothing to address the performance issues associated with that extra work.

Avoiding the RMW overhead requires that filesystems know about the larger sector size and use a block size that matches. Most filesystems are nearly ready to do that now; they are generally written with the idea that one filesystem's block size may differ from another. The challenges are, thus, not really at the filesystem level; where things get interesting is with the memory management (MM) subsystem.

The MM code deals with memory in units of pages. On most (but not all) architectures supported by Linux, a page is 4KB of memory. The MM code charged with managing the page cache (which occupies a substantial portion of a system's RAM) assumes that individual pages can easily be moved to and from the filesystems that provide their backing store. So a page fault may just bring in a single 4KB page, without regard for the fact that said page may be embedded within a larger sector on the storage device. If the 4KB page cannot be read independently, the filesystem code must read the whole sector, then copy the desired page into its destination in the page cache. Similarly, the MM code will write pages back to persistent store with no understanding of the other pages that may share the same hardware sector; that could force the filesystem code to reassemble sectors and create surprising results by writing out pages that were not, yet, meant to be written.

Avoiding these problems almost certainly means teaching the MM code to manage pages in larger chunks. There have been some attempts to do so over the years; consider, for example, Christoph Lameter's large block patch set that was covered here back in 2007. This patch enabled variable-sized chunks in the page cache, with anything larger than the native page size being stored in compound pages. And that is where this patch ran into trouble.

Compound pages are created by grouping together a suitable number of physically contiguous pages. These "higher-order" pages have always been risky for any kernel subsystem to rely on; the normal operation of the system tends to fragment memory over time, making such pages hard to find. Any code that allocates higher-order pages must be prepared for those allocations to fail; reducing the reliability of the page cache in this way was not seen as desirable. So this patch set never was seriously considered for merging.

Nick Piggin's fsblock work, also started in 2007, had a different goal: the elimination of the "buffer head" structure. It also enabled the use of larger blocks when passing requests to filesystems, but at a significant cost: all filesystems would have had to be modified to use an entirely different API. Fsblock also needed higher-order pages, and the patch set was, in general, large and intimidating. So it didn't get very far, even before Nick disappeared from the development community.

One might argue that these approaches should be revisited now. The introduction of transparent huge pages, memory compaction, and more, along with larger memory sizes in general, has made higher-order allocations much more reliable than they once were. But, as Mel Gorman explained, relying on higher-order allocations for critical parts of the kernel is still problematic. If the system is entirely out of memory, it can push some pages out to disk or, if really desperate, start killing processes; that work is guaranteed to make a number of single pages available. But there is nothing the kernel can do to guarantee that it can free up a higher-order page. Any kernel functionality that depends on obtaining such pages could be put out of service indefinitely by the wrong workload.

Avoiding higher-order allocations

Most Linux users, if asked, would not place "page cache plagued by out-of-memory errors" near the top of their list of desired kernel features, even if it comes with support for large-sector drives. So it would seem that any scheme based on being able to allocate physically contiguous chunks of memory larger than the base allocation size used by the MM code is not going to get very far. The alternatives, though, are not without their difficulties.

One possibility would be to move to the use of virtually contiguous pages in the page cache. These large pages would still be composed of a multitude of 4KB pages, but those pages could be spread out in memory; page-table entries would then be used to make them look contiguous to the rest of the kernel. This approach has special challenges on 32-bit systems, where there is little address space available for this kind of mapping, but 64-bit systems would not have that problem. All systems, though, would have the problem that these virtual pages are still groups of small pages behind the mapping. So there would still be a fair amount of overhead involved in setting up the page tables, creating scatter/gather lists for I/O operations, and more. The consensus seems to be that the approach could be workable, but that the extra costs would reduce any performance benefits considerably.

Another possibility is to increase the size of the base unit of memory allocation in the MM layer. In the early days, when a well-provisioned Linux system had 4MB of memory, the page size was 4KB. Now that memory sizes have grown by three orders of magnitude — or more — the page size is still 4KB. So Linux systems are managing far more pages than they used to, with a corresponding increase in overhead. Memory sizes continue to increase, so this overhead will increase too. And, as Ted pointed out in a different discussion late last year, persistent memory technologies on the horizon have the potential to expand memory sizes even more.

So there are good reasons to increase the base page size in Linux even in the absence of large-sector drives. As Mel put it, "It would get more than just the storage gains though. Some of the scalability problems that deal with massive amount of struct pages may magically go away if the base unit of allocation and management changes." There is only one tiny little problem with this solution: implementing it would be a huge and painful exercise. There have been attempts to implement "page clustering" in the kernel in the past, but none have gotten close to being ready to merge. Linus has also been somewhat hostile to the concept of increasing the base page size in the past, fearing the memory waste caused by internal fragmentation.

A number of unpleasant options

In the end, Mel described the available options in this way:

So far on the table is

major filesystem overhaul
major vm overhaul
use compound pages as they are today and hope it does not go completely to hell, reboot when it does

With that set of alternatives to choose from, it is not surprising that none have, thus far, developed an enthusiastic following. It seems likely that all of this could lead to a most interesting discussion at the Summit in March. Even if large-sector drives could be supported without taking any of the above options, chances are that, sooner or later, the "major VM overhaul" option is going to require serious consideration. It may mostly be a matter when somebody feels the pain badly enough to be willing to try to push through a solution.

Comments (26 posted)

Supporting Intel MPX in Linux

By Jonathan Corbet
January 29, 2014

Buffer overflows have long been a source of serious bugs and security problems at all levels of the software stack. Much work has been done over the years to eliminate unsafe library functions, add stack-integrity checking and more, but buffer overflow bugs still happen with great regularity. A recently posted kernel patch is one of the final steps toward the availability of a new tool that should help to make buffer overflow problems more uncommon: Intel's upcoming "MPX" hardware feature.

MPX is, at its core, a hardware-assisted mechanism for performing bounds checking on pointer accesses. The hardware, following direction from software, maintains a table of pointers in use and the range of accessible memory (the "bounds") associated with each. Whenever a pointer is dereferenced, special instructions can be used to ensure that the program is accessing memory within the range specified for that particular pointer. These instructions are meant to be fast, allowing bounds checking to be performed on production systems with a minimal performance impact.

As one might expect, quite a bit of supporting software work is needed to make this feature work, since the hardware cannot, on its own, have any idea of what the proper bounds for any given pointer would be. The first step in this direction is to add support to the GCC compiler. Support for MPX in GCC is well advanced, and should be considered for merging into the repository trunk sometime in the near future.

When a file is compiled with the new -fmpx flag, GCC will generate code to make use of the MPX feature. That involves tracking every pointer created by the program and the associated bounds; any time that a new pointer is created, it must be inserted into the bounds table for checking. Tracking of bounds must follow casts and pointer arithmetic; there is also a mechanism for "narrowing" a set of bounds when a pointer to an object within another object (a specific structure field, say) is created. The function-call interface is changed so that when a pointer is passed to a function, the appropriate bounds are passed with it. Pointers returned from functions also carry bounds information.

With that infrastructure in place, it becomes possible to protect a program against out-of-bounds memory accesses. To that end, whenever a pointer is dereferenced, the appropriate instructions are generated to perform a bounds check first. See Documentation/x86/intel_mpx.txt, included with the kernel patch set (described below), for details on how code generation changes. In brief: the new bndcl and bndcu instructions check a pointer reference against the lower and upper limits, respectively. If those instructions succeed, the pointer is known to be within the allowed range.

The next step is to prepare the C library for bounds checking. At a minimum, that means building the library with -fmpx, but there is more to it than that. Any library function that creates an object (malloc(), say) needs to return the proper bounds along with the pointer to the object itself. In the end, the C library will be the source for a large portion of the bounds information used within an application. The bulk of the work for the GNU C library (glibc) is evidently done and committed to the glibc git repository. Instrumentation of other libraries would also be desirable, of course, but the C library is the obvious starting point.

Then there is the matter of getting the necessary support code into the kernel; Qiaowei Ren has recently posted a patch set to do just that. Part of the patch set is necessarily management overhead: allowing applications to set up bounds tables, removing bounds tables when the memory they refer to is unmapped, and so on. But much of the work is oriented around the user-space interface to the MPX feature.

The first step is to add two new prctl() options: PR_MPX_INIT and PR_MPX_RELEASE. The first of those sets up MPX checking and turns on the feature, while the second cleans everything up. Applications can thus explicitly control pointer bounds checking, but that is not expected. Instead, the system runtime will probably turn on MPX as part of application startup, before the application itself begins to run. Current discussion on the linux-kernel list suggests that it may be possible to do the entire setup and teardown job within the user-space runtime code, making these prctl() calls unnecessary, so they may not actually find their way into the mainline kernel.

When a bounds violation is detected, the processor will trap into the kernel. The kernel, in turn, will turn the trap into a SIGSEGV signal to be delivered to the application, similar to other types of memory access errors. Applications that look at the siginfo structure passed to the signal handler from the kernel will be able to recognize a bounds error by checking the si_code field for the new SEGV_BNDERR value. The offending address will be stored in si_addr, while the bounds in effect at the time of the trap will be stored in si_lower and si_upper. But most programs, of course, will not handle SIGSEGV at all and will simply crash in this situation.

In summary, there is a fair amount of development work needed to make this hardware feature available to user applications. The good news is that, for the most part, this work appears to be done. Using MPX within the kernel itself should also be entirely possible, but no patches to that effect have been posted so far. Adding bounds checking to the kernel without breaking things is likely to present a number of interesting challenges; for example, narrowing would have to be reversed anytime the container_of() macro is used — and there are thousands of container_of() calls in the kernel. Finding ways to instrument the kernel would thus be tricky; doing this instrumentation in a way that does not make a mess out of the kernel source could be even harder. But there would be clear benefits should somebody manage to get the job done.

Meanwhile, though, anybody looking forward to MPX will have to wait for a couple of things: hardware that actually supports the feature and distributions built to use it. MPX is evidently a part of Intel's "Skylake" architecture, which is not expected to be commercially available before 2015 at the earliest. So there will be a bit of a wait before this feature is widely available. But, by the time it happens, Linux should be ready to take advantage of it.

Comments (6 posted)

Greg KH Linux 3.13.1 ?

Greg KH Linux 3.12.9 ?

Sebastian Andrzej Siewior 3.12.8-rt11 ?

Greg KH Linux 3.10.28 ?

Greg KH Linux 3.4.78 ?

Geert Uytterhoeven ARM: shmobile: RSPI RZ and QSPI SoC and board integration ?

David Long uprobes: Add uprobes support for ARM ?

vijay.kilari@gmail.com arm64: KGDB: Add Basic KGDB support ?

Qiaowei Ren Intel MPX support ?

Waiman Long Introducing a queue read/write lock implementation ?

Waiman Long qspinlock: Introducing a 4-byte queue spinlock ?

Nicolas Pitre [PATCH 0/9] setting the table for integration of cpuidle with the scheduler ?

Len Brown suspend: make sync() on suspend-to-RAM optional ?

Vivek Goyal kexec: A new system call to allow in kernel loading ?

Tejun Heo kernfs, sysfs, driver-core: implement synchronous self-removal ?

Tejun Heo cgroup: convert to kernfs ?

Dave Chinner xfstests updated to 197f773 ?

David Herrmann SimpleDRM & Sysfb ?

Lee Jones mtd: st_spi_fsm: Add new driver ?

Kuninori Morimoto shdma: add R-Car Audio DMAC peri peri driver ?

Rafael J. Wysocki PM / QoS: Introduce latency tolerance device PM QoS type ?

Rafael J. Wysocki ACPI / hotplug / PCI: Consolidation of ACPIPHP with ACPI core device hotplug ?

Tanmay Inamdar APM X-Gene PCIe controller ?

Jean-Francois Moine add a TDA998x CODEC ?

Srinivas Pandruvada Thermal: ACPI INT3403 thermal driver ?

Yuvaraj Kumar C D Exynos5250 SATA Support ?

Ezequiel Garcia Armada 370/XP watchdog support ?

Antti Palosaari v4l2: SDR API ?

Laurent Pinchart media-ctl API changes to prepare for device enumeration library ?

Hans Verkuil v4l2: Add support for complex controls ?

Krzysztof Kozlowski mfd: max14577: Add support for MAX77836 ?

Andy Gross Add Qualcomm BAM dmaengine driver ?

Peter De Schrijver efuse driver for Tegra ?

Boris BREZILLON mtd: nand: add sunxi NAND Flash Controller support ?

Juri Lelli sched/deadline: Add sched_dl documentation ?

Daniel Vetter DRM developer's guide polish, part 1 ?

Andrea Mazzoleni lib: raid: New RAID library supporting up to six parities ?

Pete Batard Announcing libusb-1.0.18 (as well as libusbx-1.0.18 *FINAL*) ?

Pavel Emelyanov Checkpoint-restore tool v1.1 ?

Ben Hutchings ethtool 3.13 released ?

Kernel development

Brief items

Kernel release status

Quotes of the week

Gorman: LSF/MM 2014 so far

Kernel development news

3.14 Merge window part 2

Preparing for large-sector drives

Do (almost) nothing

Larger blocks in the kernel

Avoiding higher-order allocations

A number of unpleasant options

Supporting Intel MPX in Linux

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Miscellaneous