Kernel development [LWN.net]

Kernel release status

The current development kernel is 3.15-rc1, released on April 13. "In comparison to those large releases, 3.15-rc1 is just big in general. No single big thing, but just lots and lots of commits. Sure, it has a few big new staging drivers (rtl8723au in particular), but even when big, those aren't nearly the bulk of things. There's just a lot going on." In the end, 12,034 non-merge changesets were pulled into the mainline repository during the 3.15 merge window.

Stable updates: 3.14.1, 3.13.10, 3.10.37, and 3.4.87 were released on April 14. The 3.2.57 update came out on April 10.

Comments (none posted)

Quote of the week

Scenario: Your mission critical app is running (controlling a giant laser cutter). Oops there is a memory error, and the bad data arrives at the application causing it to swing the laser beam through 180 degrees, destroying half of your lab. A few seconds/minutes later - your EDAC driver prints a message saying that the uncorrected error count just got incremented.

— Tony Luck

Comments (3 posted)

3.15 Merge window, part 2

By Jonathan Corbet
April 16, 2014

By the time Linus released 3.15-rc1 and closed the 3.15 merge window on April 13, he had pulled 12,034 non-merge changesets into the mainline repository for this development cycle. So 3.15 does have the honor of being the busiest merge window ever, edging out 3.10, which had a mere 11,963 non-merge commits. All but the last 700 commits for 3.15 were covered in last week's merge window summary, so the list of new features added at the end of the merge window will be relatively small. Still, there are a few things worthy of note.

Faster resume

Arguably the most interesting change is a significant speedup in resume-from-suspend time on systems with SATA disk controllers. Over the years, various efforts have been made to parallelize the bootstrap and resume processes in order to reduce the wall-clock time needed to get to a working system. These attempts have often run into difficulties as the problem space proved to be more complex than originally understood. So full parallelism remains an elusive goal.

Recently, though, some developers realized that there was a piece of especially low-hanging fruit waiting to be picked: much of the time spent waiting for a system to resume goes into waiting for the ATA controllers to power up and get into a working state. Dan Williams put together a pair of patches (one to the ATA controller driver, one to the SCSI "sd" driver) to change their behavior a bit: rather than waiting for the controller to return to a working state, the drivers start the process and return immediately. That allows the rest of the kernel to continue working toward resuming the system while the controller powers up.

Of course, some of that work is likely to involve disk I/O. Any I/O requests that are submitted while the controllers are still waking up simply wait until they can be serviced. In the worst case, the system will block on I/O and fail to resume any faster than before, but, in practice, it is generally possible to get back to the window system without the need to wait for disk I/O. The results, as documented in this page describing the patches, are impressive. Resume time on a drive-heavy system dropped from 11.6 seconds to 1.1 seconds. On a couple of different single-drive systems, resume time went from over five seconds to less than one second. It is clearly a worthwhile improvement, especially since it requires little in the way of added complexity overall.

Elsewhere in the kernel

A set of patches to enable building the kernel with the LLVM compiler suite has been merged. This goal has not yet been achieved; there is another set of patches required that, possibly, will show up in 3.16. But this goal is getting closer to being achieved after some years of sporadic effort.

In a change that has a small possibility of breaking user-space code, the x86 architecture will no longer allow the creation of 16-bit segments when running in the 64-bit mode. Use of 16-bits can lead to a kernel information leak on 64-bit systems that could lead to potential security issues. Since running 16-bit code on these systems does not work all that well anyway and it's not clear that there are any users of it, this is probably a safe change to make. If users do exist, they might want to make their presence known during this development cycle so that their concerns can be addressed.

A handful of new drivers has been merged; these add support for Qualcomm SDHCI controllers, Armada 380 and 385 Marvell SoC-based SDHCI controllers, Energymicro efm32 i2c controllers, Qualcomm QUP-based I2C controllers, Cadence I2C controllers, Freescale enhanced direct memory access (eDMA) controllers, Renesas R-Car audio DMAC peripheral controllers, QCOM bus access manager (BAM) DMA controllers, Alienware AlienFX WMI-based platform features, and CPU frequency controllers on IBM POWERNV hardware,

In the 3.15-rc1 announcement, Linus let it be known that he is even less inclined than usual to add any more feature work outside of the merge window. Enough code has already found its way in to keep developers busy for the rest of the cycle, it seems. That work can be expected to be completed sometime right around the end of May if the usual pattern holds.

Comments (16 posted)

Avoiding memory-allocation deadlocks

April 16, 2014

This article was contributed by Neil Brown

There is a saying that you need to spend money to make money, though this apparent paradox is easily resolved with a start-up loan and the discipline of balancing expenses against income. A similar logic applies to the management of memory in an operating system kernel such as Linux: sometimes you need to allocate memory to free memory. Here, too, discipline is needed, though the typical consequences of not being sufficiently careful is not bankruptcy but rather a deadlock.

The history of how the Linux kernel developed its balance between saving and spending is interesting as a microcosm of how Linux development proceeds, and useful in understanding how to handle future deadlocks when they occur. A good place to start this history is in early 1998 with the introduction of __GFP_IO in Linux 2.1.80pre3.

`__GFP_IO` in 2.1.80pre3

Any memory allocation request in Linux includes a gfp_t argument, which is a set of flags to guide how the get_free_page() function can go about locating a free page. 2.1.80pre3 marks a change in this argument's type; it went from being a simple enumerated type to being a bitmask. The concepts embodied in each flag were present previously, but this is the first time that they could be explicitly identified.

__GFP_IO was one of the new flags. If it was set, then get_free_pages() was allowed to call shm_swap() to write some pages out to swap. If shm_swap() needed to allocate any buffer_head structures to complete the writeout, it would be careful not to set __GFP_IO. Without this protection, an infinite recursion could easily happen, which would quickly exhaust the stack and cause the kernel to crash.

We have __GFP_IO in the kernel today, but, despite having the same name, it is a different flag. Having been introduced for 2.1.80, the original __GFP_IO was removed in 2.1.116, to be replaced with...

`PF_MEMALLOC` in 2.1.116

In the distant past (August 1998), we did not have change logs of nearly the quality that we have today, so an operating-system archaeologist is left to guess at the reasons for changes. All we can be really sure of is that the (per-request) __GFP_IO flag to get_free_page() disappeared, and a new per-process flag called PF_MEMALLOC appeared to take over the task of avoiding recursion. One clear benefit of this change is that it is more focused in addressing one particular issue: recursion is clearly a per-process issue and so a per-process flag is fitting. Previously, many memory allocation sites would avoid __GFP_IO when they didn't really need to, just in case. Now each call site doesn't need to worry about the problem of recursion; that concern is addressed precisely where it is needed.

The code comments here highlight an important aspect of memory allocation:

	 * The "PF_MEMALLOC" flag protects us against recursion:
	 * if we need more memory as part of a swap-out effort we
	 * will just silently return "success" to tell the page
	 * allocator to accept the allocation.

When possible, get_free_page() will just pluck a page off the free list and return it as quickly as it can. When that is not possible, it does not satisfy itself with freeing just one page, but will try to free quite a few, to save work next time. Thus, it is re-stocking that startup loan. A particular consequence of PF_MEMALLOC is that the memory allocator won't try too hard to gets lots of pages; it will make do with what it has.

This means that processes with the PF_MEMALLOC flag set will have access to the last dregs of free memory, while other processes will need to go out and free up lots of memory first before they can use any. This property of PF_MEMALLOC is still present and somewhat more formal in the current kernel. The memory allocator has a concept of "watermarks" such that, if the amount of free memory is below the chosen watermark, the allocator will try to free more memory rather than return what it has. Different __GFP flags can select different watermark levels (min, low, high). PF_MEMALLOC causes all watermarks to be ignored; if any memory is available at all, it will be returned.

PF_MEMALLOC effectively says "It is time to stop saving and start spending, or we'll have no product to sell". In consequence of this, PF_MEMALLOC is now used more broadly than just for avoiding recursion (though it still has that role). Several kernel threads, such as those for nbd, the network block device, iscsi_tcp, and the MMC card controller, all set PF_MEMALLOC, presumably so they can be sure to get memory whenever they are called upon to write out a page of memory (so it can be freed).

In contrast, the MTD driver (which manages NAND flash and has a similar role to the MMC card driver) stopped using the PF_MEMALLOC flag in 2.6.33 with a comment suggesting it was an inappropriate usage. Whether the other uses in the kernel are still justified is a question too deep for our present discussion.

`__GFP_IO` in 2.2.0pre6

When __GFP_IO reappears it has a similar purpose as the original, but for an importantly different reason. To understand that reason, it suffices to look at a comment in the code:

	/*
	 * Don't go down into the swap-out stuff if
	 * we cannot do I/O! Avoid recursing on FS
	 * locks etc.
	 */

The concern here still involves recursion, but it also involves locks, such as the per-inode mutex, the page lock, or various others. Calling into a filesystem to write out a page may require taking a lock. If any such lock is held when allocating memory then it is important to avoid calling into any filesystem code that might try to acquire the same lock. In those cases, the code must be careful not to pass __GFP_IO; in other cases, it is perfectly safe to include that flag.

So while PF_MEMALLOC avoids the specific recursion of get_free_page() calling into get_free_page(), __GFP_IO is more general and prevents any function holding a lock from calling, through get_free_page(), into any other function which might want that lock. The risk here isn't exhausting the stack as with PF_MEMALLOC; the risk is a deadlock.

One might wonder why a GFP flag was used for this rather than a process flag, which would effectively say "I am holding a filesystem lock", given that the previous experience with __GFP_IO wasn't a success. Like many software designs, it probably just "seemed like a good idea at the time".

`__GFP_FS` in 2.4.5.8

This flag started life named __GFP_BUFFER in 2.4.5.1, but didn't really work properly until 2.4.5.8 when it was renamed to __GFP_FS. Apparently there was a thinko in the original design, which required not only a range of code changes, but also a new name.

__GFP_FS effectively split some functionality away from __GFP_IO so that where there was one flag, now there were two. Only three combinations of the two were expected: neither, both, or the new possibility of just the __GFP_IO flag being set. This would allow buffers that were already prepared to be written out, but would prohibit any calls into filesystems to prepare those buffers. I/O activity would be permitted, but filesystem activity would not.

Presumably, the fact that __GFP_IO previously had such a broad effect was harming performance, in that it had to be excluded in places where some I/O was still possible. Refining the rules by adding a new flag led to more flexibility, and so fewer impediments to performance.

`PF_FSTRANS` in 2.5.36

This new process flag appeared when XFS support was merged into Linux in late 2002. Its purpose was to indicate that a filesystem transaction (hence the name) was being prepared, meaning that any write to the filesystem would likely block until the transaction processing was complete. The effect of this flag was to exclude __GFP_FS from any memory allocation request which happened while PF_FSTRANS was set, or at least any request from within the XFS code. Other requests would not be affected, but then other code that allocated memory would be unlikely to be called while the flag was set.

Another way to see this flag is that, in the same way that the original __GFP_IO was converted to PF_MEMALLOC, now __GFP_FS is being converted to a process flag, too. In this case, the conversion is not complete, though.

Back in the halcyon days of 2.1.116, removing a flag like __GFP_IO was quite straightforward — there were few users and the implications of the change could be easily understood. In the more complex times of 2.5.36, such a step would be far from easy. Carefully adding new functionality is one thing, removing something that is entrenched is quite another, as we have seen with the Big Kernel Lock and the sleep_on() interface. Allowing either the new flag or the absence of the old to have the same effect is not a big cost and it was best to leave things that were working alone.

Skipping ahead of ourselves a little to 3.6-rc1, the PF_FSTRANS flag also gets used by NFS. Rather than setting it during a transaction, NFS sets it while constructing and transmitting an RPC request onto the network, so the name is now slightly less appropriate. Also the effect of the flag on NFS is not exactly to clear __GFP_FS, but simply to avoid a call to transmit a COMMIT request inside nfs_release_page(), which is also avoided if __GFP_FS is missing. This is a superficially different usage than the usage by XFS, but it has a generally similar effect for a generally similar reason. Modifying the flag to have a more global effect of clearing GFP_FS and maybe renaming it to PF_MEMALLOC_NOFS might not be a bad idea.

`set_gfp_allowed_mask()` in 2.6.34

This function actually appeared in 2.6.31, but becomes more interesting in 2.6.34.

gfp_allowed_mask is a global variable holding a set of GFP flags which are allowed to be honored — all others are ignored. In particular, __GFP_FS, __GFP_IO, and __GFP_WAIT (which generally allows get_free_page() to wait for memory to be freed by other processes) are sometimes disabled via this mechanism. Thus it is a bit like PF_FSTRANS, except that it affects more processes and disables more flags.

gfp_allowed_mask came about while providing support for kmalloc() earlier in the boot process. During early boot, interrupts are still disabled and any attempt to allocate memory with __GFP_WAIT or certain other flags can trigger a warning from the lockdep checker. It would be surprising if memory were so tight during boot that the allocator actually needed to wait, but getting rid of warnings is generally a good thing, so gfp_allowed_mask was initialized to exclude the three flags mentioned, and these were added back in once the boot process was complete.

One thing we have learned over the years is that boot isn't as special as we sometimes think: whether it is suspend and resume, or hotplug hardware which teaches us this, it seems to be a lesson we keep finding new perspectives on. In that light, it is perhaps unsurprising that, in 2.6.34, the use of this mask was extended to cover suspend and resume (though an early version of the original patch did mention the importance of suspend).

In the case of memory allocation deadlocks, the suspend case is more significant than the boot case. During boot there is usually lots of free memory — not so during suspend, when we may well be short of memory. It wasn't warnings that prompted this change, but genuine deadlocks.

Suspend and resume are largely orderly processes, with devices being put to sleep in sequence, and then woken again in the reverse sequence. So it would not be enough just for block devices to avoid using __GFP_IO (which they already do). Rather, every driver must avoid the __GFP_IO flag, and others, as the target block device of some write request, might be sequenced with this driver so that it is already asleep, and will not awake before this one is completely awake.

Having a system-wide setting to disable these flags may be a bit excessive — just the process which is sequencing suspend might be sufficient — but it is certainly an easy fix and, as it cannot affect normal running of the system, it is thus a safe fix.

`PF_MEMALLOC_NOIO` in 3.9-rc1

Just as suspend/resume has taught us that boot-time is not that much of a special case, so too runtime power management has taught us that suspend isn't all that much of a special case either. If a block device is runtime-suspended to save power, then obviously it cannot handle requests to write out a dirty page of memory until it has woken up, and until any devices it depends on (a USB controller, a PCI bus) are awake too. So none of these devices can safely perform memory allocation using __GFP_IO.

In order to ensure this, we could use set_gfp_allowed_mask() while a device was suspending or resuming, but if multiple such devices were suspending or resuming we could easily lose track of when to restore the right mask. So this change introduces a process flag much like PF_FSTRANS, only to disable __GFP_IO rather than __GFP_FS. It also takes care to record the old value whenever the flag is set, and restore that old value when done. To know when to set this flag, a memalloc_noio flag is introduced for each device; it is then propagated into the parents in the device tree. PF_MEMALLOC_NOIO is set whenever calling into the power management code for any device with memalloc_noio set.

As both the early boot processing and the suspend/resume processing are largely single-threaded (or have dedicated threads), it is quite possible that setting PF_MEMALLOC_NOIO and PF_FSTRANS on those threads would be a sufficient alternative to using set_gfp_allowed_mask(). However, as there is no clear benefit from such a change, and no clear certainty that it would work, it is safer, once again, to leave that which works alone.

Patterns that emerge

Amid all these details there are a couple of patterns which stand out.

The first is repeated refinement of the "avoid recursion" concept. At first it was implicit in an enumerated value passed to get_free_page(), then it was made explicit in the first __GFP_IO, and then the PF_MEMALLOC flag. Next it was extended to cover more subtle forms of recursion with the second version of __GFP_IO and, finally, that was split into two separate flags to express an even wider range of recursion scenarios that can be separately avoided.

It is conceivable that there is room for further refinement. We could have separate flags for different sorts of locks — one for page locks and one for inode locks, for example. There is no evidence that this would presently be useful, but Linux isn't really finished yet, so we just don't know.

The second pattern is the repeated discovery that just having a GFP flag often isn't enough — three times a new process flag was added because sometimes it isn't just a single allocation that needs to be controlled, but all allocations made by a given process. Is it only a matter of time before we get either a process flag which disables __GFP_WAIT or a per-process gfp_allowed_mask?

As a footnote to this pattern, it is amusing that in 3.6-rc1, as part of adding support for swap-over-NFS, a new flag, __GFP_MEMALLOC, was added which has much the same effect as PF_MEMALLOC in ignoring the normal low-watermarks and providing access to the last reserves of memory. This, together with the per-socket sk_allocation mask, allows certain TCP sockets (those which NFS is performing swap over) to access those last reserves to make sure that swap-out always succeeds. Clearly there is need for both GFP flags and process flags, as well as some per-device and per-socket flags.

We've not seen the last of this

While studying history can be generally enlightening, it can also be specifically useful as it is in this case. Next week, we will use this understanding of memory allocation deadlocks to explore some deadlocks which have long been possible in a certain configuration, but which now need to be removed.

Comments (13 posted)

Linus Torvalds Linux 3.15-rc1 out, merge window closed ?

Greg KH Linux 3.14.1 ?

Sebastian Andrzej Siewior 3.14-rt1 ?

Greg KH Linux 3.13.10 ?

Greg KH Linux 3.10.37 ?

Greg KH Linux 3.4.87 ?

Ben Hutchings Linux 3.2.57 ?

Matthias Brugger arm: Add basic support for Mediatek Cortex-A7 SoCs ?

Chanwoo Choi Support new Exynos3250 SoC based on Cortex-A7 dual core ?

Tarek Dakhran Exynos 5410 support ?

Thomas Petazzoni SMP support for Armada 375 and 38x ?

Anders Berg Add platform support for LSI AXM55xx ?

Alex Elder ARM: SMP: support Broadcom mobile SoCs ?

Steve Capper Huge pages for short descriptors on ARM ?

Markos Chandras MIPS: net: Add BPF JIT ?

Steven Rostedt rwsem: The return of multi-reader PI rwsems ?

Peter Zijlstra sched,idle: need resched polling rework ?

Matthew Wilcox Page I/O ?

Tejun Heo cgroup: implement unified hierarchy, v2 ?

Tejun Heo cgroup: implement cgroup.populated, v2 ?

David Herrmann File Sealing & memfd_create() ?

Rui Wang I/O Hook: Trace h/w access and emulate h/w events ?

Harini Katakam SPI: Add driver for Cadence SPI controller ?

Boris BREZILLON ARM: sunxi: add multi pin controller support ?

Andreas Noever Thunderbolt support for Apple MBP ?

rogerable@realtek.com Add modules for realtek USB card reader ?

Gabriel FERNANDEZ ARM: STi: Add Clock driver support STiH415 & STiH416 ?

Gabriel FERNANDEZ Add ST Keyscan driver ?

Michael Welling tty serial: xr17c15x driver ?

Iyappan Subramanian net: Add APM X-Gene SoC Ethernet driver support ?

Krzysztof Kozlowski mfd: max14577: Add support for MAX77836 ?

Stanimir Varbanov Add Qualcomm crypto driver ?

Vivien Didelot HID: (thingm) introduces blink(1) mk2 ?

Srinivas Pandruvada Quaternion support ?

Mohit Kumar PCI: Add SPEAr13xx PCie support ?

Antti Palosaari [2013:025f] PCTV tripleStick (292e) ?

David Cohen Initial implementation of Intel MID watchdog driver ?

Xiubo Li Add Freescale FlexTimer Module timer ?

Jingoo Han Add support for Samsung GH7 PCIe controller ?

Lan Tianyu I2C ACPI operation region handler support ?

Michael Kerrisk Documenting prctl() PR_SET_THP_DISABLE and PR_GET_THP_DISABLE ?

Tejun Heo [PATCH cgroup/for-3.16] cgroup: add documentation about unified hierarchy ?

Tejun Heo blkcg: prepare blkcg knobs for default hierarchy ?

Liu Bo Online(inband) data deduplication ?

Eric W. Biederman No I/O from mntput ?

NeilBrown Support loop-back NFS mounts ?

Luiz Capitulino hugetlb: add support gigantic page allocation at runtime ?

riel@redhat.com sched,numa: reduce page migrations with pseudo-interleaving ?

John Stultz Volatile Ranges (v13) ?

Minchan Kim mm: support madvise(MADV_FREE) ?

Andrey Vagin tcp: allow to repair a tcp connections in closing states ?

Pablo Neira Ayuso new transaction infrastructure for nf_tables (v4) ?

Patrick McHardy : Release of nftables 0.2 ?

Vivek Goyal net: Implement SO_PEERCGROUP and SO_PASSCGROUP socket options ?

Stephan Mueller SP800-90A Deterministic Random Bit Generator ?

Torsten Duwe : hwrng: an in-kernel rngd ?

Andy Lutomirski random: Use DRBG sources ?

Stephen Hemminger iproute2 3.14.0 ?

Arturo Borrero Gonzalez nft event monitor ?

Pablo Neira Ayuso translations from iptables to nft ?

Andi Kleen perf, tools: Support spark lines in perf stat v3 ?

Kernel development

Brief items

Kernel release status

Quote of the week

Kernel development news

3.15 Merge window, part 2

Faster resume

Elsewhere in the kernel

Avoiding memory-allocation deadlocks

`__GFP_IO` in 2.1.80pre3

`PF_MEMALLOC` in 2.1.116

`__GFP_IO` in 2.2.0pre6

`__GFP_FS` in 2.4.5.8

`PF_FSTRANS` in 2.5.36

`set_gfp_allowed_mask()` in 2.6.34

`PF_MEMALLOC_NOIO` in 3.9-rc1

Patterns that emerge

We've not seen the last of this

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Miscellaneous

Kernel development

Brief items

Kernel release status

Quote of the week

Kernel development news

3.15 Merge window, part 2

Faster resume

Elsewhere in the kernel

Avoiding memory-allocation deadlocks

__GFP_IO in 2.1.80pre3

PF_MEMALLOC in 2.1.116

__GFP_IO in 2.2.0pre6

__GFP_FS in 2.4.5.8

PF_FSTRANS in 2.5.36

set_gfp_allowed_mask() in 2.6.34

PF_MEMALLOC_NOIO in 3.9-rc1

Patterns that emerge

We've not seen the last of this

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Miscellaneous

`__GFP_IO` in 2.1.80pre3

`PF_MEMALLOC` in 2.1.116

`__GFP_IO` in 2.2.0pre6

`__GFP_FS` in 2.4.5.8

`PF_FSTRANS` in 2.5.36

`set_gfp_allowed_mask()` in 2.6.34

`PF_MEMALLOC_NOIO` in 3.9-rc1