Kernel development [LWN.net]

Kernel release status

The current development kernel is 4.10-rc5, released on January 22. Linus noted that "everything looks nominal". He also changed the codename from the short-lived "Roaring Lionus" to "Anniversary Edition".

Stable updates: 4.9.5 and 4.4.44 were released on January 20. The 4.9.6 and 4.4.45 updates are in the review process as of this writing; they can be expected on or after January 26.

Comments (5 posted)

Vetter: Maintainers don't scale

Daniel Vetter has posted the text of his linux.conf.au talk on kernel maintenance. "At least for me, review isn’t just about ensuring good code quality, but also about diffusing knowledge and improving understanding. At first there’s maybe one person, the author (and that’s not a given), understanding the code. After good review there should be at least two people who fully understand it, including corner cases. And that’s also why I think that group maintainership is the only way to run any project with more than one regular contributor."

Comments (none posted)

The future of the page cache

By Jonathan Corbet
January 25, 2017

linux.conf.au 2017

The promise of large-scale persistent memory has forced a number of changes in the kernel and has raised questions about whether the kernel's page cache will be needed at all in the future. In his linux.conf.au 2017 talk, Matthew Wilcox asserted that not only do we still need the page cache, but that its role should be increased. First, though, there is the small matter of correcting a mistake made by a certain Mr. Wilcox a couple of years ago.

This was, he started, his first talk ever as a Microsoft employee — something he thought he would never find himself saying. He then launched into his topic by saying that computing is all about caching. His new laptop can execute 10 billion instructions per second, but only as long as it doesn't take a cache miss. Memory on that system can only deliver 530 million cache lines per second, so it doesn't take many cache misses to severely impact its performance. Things get even worse if the data you want isn't cached in main memory and has to be read from a storage device, even a fast solid-state device.

It has always been that way; a PDP-11 was also significantly slowed by cache misses. But the problem is getting worse. CPU speeds have increased more than memory speeds, which, in turn, have increased more than storage speeds. The cost of not caching your data properly is thus going up.

The page cache

Unix systems have had a buffer cache, which sits between the filesystem and the disk for the purpose of caching disk blocks in memory, for a long time. While preparing the talk, he went back to look at Sixth-edition Unix (released in 1975) and found a buffer cache there. Linux has had a buffer cache since the beginning. In the 1.3.50 release in 1995, Linus Torvalds added a significant innovation in the form of the page cache. This cache differs from the buffer cache in that it sits between the virtual filesystem (VFS) layer and the filesystem itself. With the page cache, there is no need to call into filesystem code at all if the desired page is present already. Initially, the page and buffer caches were entirely separate, but Ingo Molnar unified them in 1999. Now, the buffer cache still exists, but its entries point into the page cache.

The page cache has a great deal of functionality built into it. There are some obvious functions, like finding a page at a given index; if the page doesn't exist, it can be created and optionally filled from disk. Dirty pages can be pushed back to disk. Pages can be locked, unlocked, and removed from the cache. Threads can wait for changes in a page's state, and there are interfaces to search for pages in a given state. The page cache is also able to keep track of errors associated with persistent storage.

Locking for the page cache is handled internally. There tends to be disagreement in the kernel community over the level at which locking should be handled; in this case it has been settled in favor of internal locking. There is a spinlock to control access when changes are being made to the page cache, but lookups are handled using the lockless read-copy-update (RCU) mechanism.

Caching is the art of predicting the future, he said. When the cache grows too large, various heuristics come into play to decide which pages should be removed. Pages used only once are likely to not be used again, so those are kept in the "inactive" list and pushed out relatively quickly. A second use will promote a page from the inactive list to the active list. Unused pages eventually age off the active list and are put back onto the inactive list. Exceptional "shadow" entries are used to track pages that have fallen off the end of the inactive list and have been reclaimed; these entries have the effect of lengthening the kernel's memory about pages that were used in the relatively distant past.

Huge pages have been a challenge for the page cache for a while. The kernel's transparent huge page feature initially only worked with anonymous (non file-backed) memory. There are good reasons for using huge pages in the page cache, though. Initial work in this area simply adds a large set of single-page entries to the page cache to correspond to a single huge page. Wilcox concluded that this approach was "silly"; he enhanced the radix tree code, used to track pages in the page cache, to be able to handle huge-page entries directly. Pending patches will cause the page cache to use a single entry for huge pages.

Do we still need the page cache?

Recently, Dave Chinner asserted that there was no longer a need for a page cache. He noted that the DAX subsystem, initially created by Wilcox to provide direct access to file data stored in persistent memory, bypasses the page cache entirely. "There is nothing like having your colleagues question your entire motivation", Wilcox said. There are people who disagree with Chinner, though, including Torvalds, who popped up in a separate forum saying that the page cache is important because good things don't come from having low-level filesystem code in the critical path for data access.

With that last statement in mind, Wilcox delved into how an I/O request using DAX works now. He designed the original DAX code and, in so doing, concluded that there was no need to use the page cache. That decision, he said, was wrong.

In current kernels, when an application makes a system call like read() to read some data from a file stored in persistent memory, DAX gets involved. Since the requested data is not present in the page cache, the VFS layer calls the filesystem-specific read_iter() function. That, in turn, calls into the DAX code, which will call back into the filesystem to turn the file offset into a block number. Then the block layer is queried to get the location of that block in persistent memory (mapping it into the kernel's address space if need be) so that the block's contents can be copied back to the application.

That is "not awful", but it should work differently, he said. The initial steps would be the same, in that the read_iter() function would still be called, and it would call into the DAX code. But, rather than calling back into the filesystem, DAX should call into the page cache to get the physical address associated with the desired offset in the file. The data is then copied back to user space from that address. This all assumes that the information is already present in the page cache but, when that is the case, the low-level filesystem code need not get involved at all. The filesystem had already done the work, and the page cache had cached the result.

When Torvalds wrote the above-mentioned post about the page cache, he said:

It's also a major disaster from a locking standpoint: trust me, if you think your filesystem can do fine-grained locking right when it comes to things like concurrent lookup of pathnames, you're living in a dream world.

This, Wilcox said, was "so right"; the locking in DAX has indeed been disastrous. He originally thought it would be possible to get away with relatively simple locking, but complexity crept in with each new edge case that was discovered. DAX locking is now "really ugly" and he is sorry that he made the mistake of thinking that he could bypass the page cache. Now, he said, he has to fix it.

Future work

He concluded with a number of enhancements he would like to see made around DAX and the page cache. The improved huge-page support mentioned above is one of them; that is already sitting in the -mm tree and should be done soon. The use of page-frame numbers instead of page structures has been under discussion for a while since there is little desire to make the kernel store vast numbers of page structures for large persistent memory arrays.

He would like to revisit the idea of filesystems with a block size larger than the system's page size. That is something that people have wanted for many years; now that the page cache can handle more than one page size, it should be possible. "A simple matter of coding", he said. He is looking for other interested developers to work with on this project.

Huge swap entries are also an area of interest. We have huge anonymous pages in memory but, when it comes time to swap them out, they get broken up into regular pages. "That is probably the wrong answer". There is work in improving swap performance, but it needs to be reoriented toward keeping huge pages together. That might help with the associated idea of swapping to persistent memory. Data in a persistent-memory swap space can still be accessed, so it may well make sense to just leave it there, especially if it is not being heavily modified.

The video of this talk, including a bonus section on page-cache locking, is available.

[Your editor would like to thank linux.conf.au and the Linux Foundation for assisting with his travel to the event.]

Comments (14 posted)

A pair of GCC plugins

By Jake Edge
January 25, 2017

Over the last year or more, multiple hardening features have made their way from the grsecurity/PaX kernels into the mainline under the auspices of the Kernel Self Protection Project. One that was added for the 4.8 kernel is the GCC plugin infrastructure that allows processing kernel code during the build to inject various kinds of protections. Several plugins have been merged, most notably the latent_entropy plugin for 4.9. Two other plugins have recently been proposed: kernexec for preventing the kernel from executing user-space code and structleak to clear structure fields that might be copied to user space.

kernexec

If the kernel is tricked into executing user-space memory, that can be used by attackers to subvert the system. An attacker can run the code of their choice with the kernel's privileges. So the ability to prevent that is an important hardening feature that is implemented in hardware as Supervisor Mode Execution Protection (SMEP) on some Intel CPUs and as Privileged Access Never (PAN) on some ARM systems.

For those x86_64 systems that lack SMEP, though, kernexec can provide much the same protection. In mid-January, Kees Cook posted an initial version of the kernexec plugin. The plugin changes the kernel so that, at run time, addresses used to make C function calls always have the high bit set. All kernel functions reside in the kernel address space, which has the high bit set. Since the Linux kernel will never map user-space memory at addresses with the high bit set, attempts to run user-space code by overwriting addresses to point into user space will fail. Instead of executing code at the address arranged for by the attacker, the plugin arranges to trigger a general protection fault instead. Similarly, return addresses are forced at run time to have the high bit set before the return instruction is executed.

The performance impact of kernel hardening efforts is always a concern, so the plugin attempts to optimize the calls and return instructions. If a register is available, the call site simply does a logical-or of the address and 0x8000000000000000 that it loads into the register. For the return, it uses a bit-set instruction (btsq) to set the high bit of the return address on the stack.

Cook notes that there is "significant coverage missing" with this version of the plugin. It is missing the assembly language pieces, which means that assembly code can still make calls into or return to user-space addresses. That infrastructure still needs to be ported over from PaX, he said.

structleak

Kernel structures (or fields contained within them) are often copied to user space. If those structures are not initialized, though, they can contain "interesting" values that have lingered in the kernel's memory. If an attacker can arrange for those values to line up with the structure and get them copied to user space, the result is a kernel information leak. CVE-2013-2141 was a leak of that sort; it led "PaX Team" (who develops the PaX patch set) to create the structleak plugin.

Cook also posted a port of that plugin to the kernel mailing list on January 13. It looks for the __user attribute (which is an annotation that is used to indicate user-space pointers) on fields in structures declared as variables local to a function. If those variables are not initialized (thus would still contain "garbage" from the stack), the plugin zeroes them out. In that way, if those values get copied to user space at some point, there will be no exposure of kernel memory contents.

PaX Team commented on the patch posting, mostly suggesting tweaks to some of the text accompanying the plugin. In particular, Cook had changed the description of the plugin in the Kconfig description from what is in PaX. However, Cook had reasonable justifications for most of those changes.

In addition, the wording of a Kconfig option that turns on verbose mode for structleak (GCC_PLUGIN_STRUCTLEAK_VERBOSE) did not meet with PaX Team's approval. It notes that false positives can be reported since "not all existing initializers are detected by the plugin", but PaX Team objected to that characterization: "a variable either has a constructor or it does not ;)". But Cook looks at things a bit differently:

Well, as pointed out, there are plenty of false positives where the [plugin] reports the need to initialize the variable when it doesn't. It doesn't report that it's missing a constructor. :) This is a pragmatic description of what is happening, and since the plugin does sometimes needlessly insert initializations where none are needed, that really seems like a false positive to me. :)

Beyond wording issues, though, as Mark Rutland pointed out, the __user annotation is not a true indication that there is a problem:

To me, it seems that the __user annotation can only be an indicator of an issue by chance. We have structures with __user pointers in structs that will never be copied to userspace, and conversely we have structs that don't contain a __user field, but will be copied to userspace.

He suggested that analyzing calls to copy_to_user() and friends might allow better detection. PaX Team agreed, but said that the original idea was to find a simple pattern to match to eliminate CVE-2013-2141 and other, similar bugs. Now that the bug is fixed, it is unclear if the plugin is actually blocking any problems, but there is little reason not to keep it, PaX Team said: "i keep this plugin around because it costs nothing to maintain it and the alternative (better) solution doesn't exist yet."

These are both fairly straightforward hardening features that may prevent kernel bugs from being (ab)used by attackers. Structleak may not truly be needed at this point, but new code could introduce a similar problem and the plugin is not particularly intrusive. Kernexec, on the other hand, has the potential to stop attacks that rely on the kernel executing user-space code in their tracks. While both plugins have existed out of tree for some time, getting them upstream so that distributors can start building their kernels that way, thus get them in the hands of more Linux users, can only be a good thing. Hopefully we will see some of the others make their way into the mainline before too long as well.

Comments (18 posted)

Linus Torvalds Linux 4.10-rc5 Jan 22

Greg KH Linux 4.9.5 Jan 20

Greg KH Linux 4.4.44 Jan 20

AKASHI Takahiro arm64: add kdump support Jan 24

Khalid Aziz Application Data Integrity feature introduced by SPARC M7 Jan 25

Ricardo Neri x86: Enable User-Mode Instruction Prevention Jan 25

Elena Reshetova refcount_t API + usage Jan 18

Josh Poimboeuf livepatch: hybrid consistency model Jan 19

Frederic Weisbecker cputime: Convert core use of cputime_t to nsecs v3 Jan 22

Agustin Vega-Frias irqchip: qcom: Add IRQ combiner driver Jan 18

Benjamin Gaignard Add PWM and IIO timer drivers for STM32 Jan 18

Amelie Delaunay Add RTC support on STM32F746 Jan 19

M'boumba Cedric Madianga Add support for the STM32F4 I2C Jan 19

Arnaud Pouliquen Add STM32 DFSDM support Jan 23

Bartosz Golaszewski ARM: da850-lcdk: add SATA support Jan 19

Andrey Smirnov i.MX7 PCI support Jan 19

Brijesh Singh Introduce AMD Secure Processor device Jan 19

Eric Auger KVM PCIe/MSI passthrough on ARM/ARM64 and IOVA reserved regions Jan 19

Chris Zhong Rockchip dw-mipi-dsi driver Jan 20

Alexander Kochetkov Implement clocksource for rockchip SoC using rockchip timer Jan 24

Eugeniy Paltsev dmaengine: Add DW AXI DMAC driver Jan 20

Joshua Clayton Altera Cyclone Passive Serial SPI FPGA Manager Jan 19

Stephen Boyd Support qcom's HSIC USB and rewrite USB2 HS support Jan 20

Noralf Trønnes drm: Add support for tiny LCD displays Jan 22

Paul Cercueil Ingenic JZ4740 / JZ4780 pinctrl driver Jan 22

Jarkko Sakkinen RFC: in-kernel resource manager Jan 23

sean.wang@mediatek.com leds: add leds-mt6323 support on MT7623 SoC Jan 23

YT Shen MT2701 DRM support Jan 23

Chen-Yu Tsai clk: sunxi-ng: Add support for A80 CCUs Jan 24

Gregory CLEMENT Add support for the ethernet switch on the ESPRESSObin Jan 24

Eugeniy Paltsev dmaengine: Add DW AXI DMAC driver Jan 25

Rob Rice Add Broadcom SPU Crypto Driver Jan 25

Alexander Loktionov net: ethernet: aquantia: Add AQtion 2.5/5 GB NIC driver Jan 18

Chris Brandt clocksource: Add renesas-ostm timer driver Jan 23

Markus Heiser pure python kernel-doc parser and more Jan 24

Darrick J. Wong xfs: online scrub/repair support Jan 21

Dan Williams mm: sub-section memory hotplug support Jan 19

Benjamin Gaignard Simple allocator Jan 20

Mel Gorman Use per-cpu allocator for !irq requests and prepare for a bulk allocator v5 Jan 23

Mike Rapoport userfaultfd: non-cooperative: better tracking for mapping changes Jan 24

Dave Jiang 1G transparent hugepage support for device dax Jan 23

Jiri Pirko Add support for offloading packet-sampling Jan 23

Wei Wang net/tcp-fastopen: Add new userspace API support Jan 23

James Bottomley Add session handling to tpm spaces Jan 23

Marcelo Tosatti KVM virtual PTP driver (v3) Jan 20

Andi Kleen New attempt at adding an disassembler to perf Jan 18

Karel Zak util-linux v2.29.1 Jan 20

Joe Stringer Libbpf improvements Jan 22

Kernel development

Brief items

Kernel release status

Vetter: Maintainers don't scale

Kernel development news

The future of the page cache

The page cache

Do we still need the page cache?

Future work

A pair of GCC plugins

kernexec

structleak

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous