Kernel development
Brief items
Kernel release status
The current development kernel is 4.10-rc5, released on January 22. Linus noted that "everything looks nominal". He also changed the codename from the short-lived "Roaring Lionus" to "Anniversary Edition".
Stable updates: 4.9.5 and 4.4.44 were released on January 20. The 4.9.6 and 4.4.45 updates are in the review process as of this writing; they can be expected on or after January 26.
Vetter: Maintainers don't scale
Daniel Vetter has posted the text of his linux.conf.au talk on kernel maintenance. "At least for me, review isn’t just about ensuring good code quality, but also about diffusing knowledge and improving understanding. At first there’s maybe one person, the author (and that’s not a given), understanding the code. After good review there should be at least two people who fully understand it, including corner cases. And that’s also why I think that group maintainership is the only way to run any project with more than one regular contributor."
Kernel development news
The future of the page cache
The promise of large-scale persistent memory has forced a number of changes in the kernel and has raised questions about whether the kernel's page cache will be needed at all in the future. In his linux.conf.au 2017 talk, Matthew Wilcox asserted that not only do we still need the page cache, but that its role should be increased. First, though, there is the small matter of correcting a mistake made by a certain Mr. Wilcox a couple of years ago.This was, he started, his first talk ever as a Microsoft employee — something he thought he would never find himself saying. He then launched into his topic by saying that computing is all about caching. His new laptop can execute 10 billion instructions per second, but only as long as it doesn't take a cache miss. Memory on that system can only deliver 530 million cache lines per second, so it doesn't take many cache misses to severely impact its performance. Things get even worse if the data you want isn't cached in main memory and has to be read from a storage device, even a fast solid-state device.
It has always been that way; a PDP-11 was also significantly slowed by cache misses. But the problem is getting worse. CPU speeds have increased more than memory speeds, which, in turn, have increased more than storage speeds. The cost of not caching your data properly is thus going up.
The page cache
Unix systems have had a buffer cache, which sits between the filesystem and the disk for the purpose of caching disk blocks in memory, for a long time. While preparing the talk, he went back to look at Sixth-edition Unix (released in 1975) and found a buffer cache there. Linux has had a buffer cache since the beginning. In the 1.3.50 release in 1995, Linus Torvalds added a significant innovation in the form of the page cache. This cache differs from the buffer cache in that it sits between the virtual filesystem (VFS) layer and the filesystem itself. With the page cache, there is no need to call into filesystem code at all if the desired page is present already. Initially, the page and buffer caches were entirely separate, but Ingo Molnar unified them in 1999. Now, the buffer cache still exists, but its entries point into the page cache.
The page cache has a great deal of functionality built into it. There are
some obvious functions, like finding a page at a given index; if the page
doesn't exist, it can be created and optionally filled from disk. Dirty
pages can be pushed back to disk. Pages can be locked, unlocked, and removed
from the cache. Threads can wait for changes in a page's state, and there
are interfaces to search for pages in a given state. The page cache is
also able to keep track of errors associated with persistent storage.
Locking for the page cache is handled internally. There tends to be disagreement in the kernel community over the level at which locking should be handled; in this case it has been settled in favor of internal locking. There is a spinlock to control access when changes are being made to the page cache, but lookups are handled using the lockless read-copy-update (RCU) mechanism.
Caching is the art of predicting the future, he said. When the cache grows too large, various heuristics come into play to decide which pages should be removed. Pages used only once are likely to not be used again, so those are kept in the "inactive" list and pushed out relatively quickly. A second use will promote a page from the inactive list to the active list. Unused pages eventually age off the active list and are put back onto the inactive list. Exceptional "shadow" entries are used to track pages that have fallen off the end of the inactive list and have been reclaimed; these entries have the effect of lengthening the kernel's memory about pages that were used in the relatively distant past.
Huge pages have been a challenge for the page cache for a while. The kernel's transparent huge page feature initially only worked with anonymous (non file-backed) memory. There are good reasons for using huge pages in the page cache, though. Initial work in this area simply adds a large set of single-page entries to the page cache to correspond to a single huge page. Wilcox concluded that this approach was "silly"; he enhanced the radix tree code, used to track pages in the page cache, to be able to handle huge-page entries directly. Pending patches will cause the page cache to use a single entry for huge pages.
Do we still need the page cache?
Recently, Dave Chinner asserted that there was no longer a need for a page cache. He noted that the DAX subsystem, initially created by Wilcox to provide direct access to file data stored in persistent memory, bypasses the page cache entirely. "There is nothing like having your colleagues question your entire motivation", Wilcox said. There are people who disagree with Chinner, though, including Torvalds, who popped up in a separate forum saying that the page cache is important because good things don't come from having low-level filesystem code in the critical path for data access.
With that last statement in mind, Wilcox delved into how an I/O request using DAX works now. He designed the original DAX code and, in so doing, concluded that there was no need to use the page cache. That decision, he said, was wrong.
In current kernels, when an application makes a system call like read() to read some data from a file stored in persistent memory, DAX gets involved. Since the requested data is not present in the page cache, the VFS layer calls the filesystem-specific read_iter() function. That, in turn, calls into the DAX code, which will call back into the filesystem to turn the file offset into a block number. Then the block layer is queried to get the location of that block in persistent memory (mapping it into the kernel's address space if need be) so that the block's contents can be copied back to the application.
That is "not awful", but it should work differently, he said. The initial steps would be the same, in that the read_iter() function would still be called, and it would call into the DAX code. But, rather than calling back into the filesystem, DAX should call into the page cache to get the physical address associated with the desired offset in the file. The data is then copied back to user space from that address. This all assumes that the information is already present in the page cache but, when that is the case, the low-level filesystem code need not get involved at all. The filesystem had already done the work, and the page cache had cached the result.
When Torvalds wrote the above-mentioned post about the page cache, he said:
This, Wilcox said, was "so right"; the locking in DAX has indeed been disastrous. He originally thought it would be possible to get away with relatively simple locking, but complexity crept in with each new edge case that was discovered. DAX locking is now "really ugly" and he is sorry that he made the mistake of thinking that he could bypass the page cache. Now, he said, he has to fix it.
Future work
He concluded with a number of enhancements he would like to see made around DAX and the page cache. The improved huge-page support mentioned above is one of them; that is already sitting in the -mm tree and should be done soon. The use of page-frame numbers instead of page structures has been under discussion for a while since there is little desire to make the kernel store vast numbers of page structures for large persistent memory arrays.
He would like to revisit the idea of filesystems with a block size larger than the system's page size. That is something that people have wanted for many years; now that the page cache can handle more than one page size, it should be possible. "A simple matter of coding", he said. He is looking for other interested developers to work with on this project.
Huge swap entries are also an area of interest. We have huge anonymous pages in memory but, when it comes time to swap them out, they get broken up into regular pages. "That is probably the wrong answer". There is work in improving swap performance, but it needs to be reoriented toward keeping huge pages together. That might help with the associated idea of swapping to persistent memory. Data in a persistent-memory swap space can still be accessed, so it may well make sense to just leave it there, especially if it is not being heavily modified.
The video of this talk, including a bonus section on page-cache locking, is available.
[Your editor would like to thank linux.conf.au and the Linux Foundation for assisting with his travel to the event.]
A pair of GCC plugins
Over the last year or more, multiple hardening features have made their way from the grsecurity/PaX kernels into the mainline under the auspices of the Kernel Self Protection Project. One that was added for the 4.8 kernel is the GCC plugin infrastructure that allows processing kernel code during the build to inject various kinds of protections. Several plugins have been merged, most notably the latent_entropy plugin for 4.9. Two other plugins have recently been proposed: kernexec for preventing the kernel from executing user-space code and structleak to clear structure fields that might be copied to user space.
kernexec
If the kernel is tricked into executing user-space memory, that can be used by attackers to subvert the system. An attacker can run the code of their choice with the kernel's privileges. So the ability to prevent that is an important hardening feature that is implemented in hardware as Supervisor Mode Execution Protection (SMEP) on some Intel CPUs and as Privileged Access Never (PAN) on some ARM systems.
For those x86_64 systems that lack SMEP, though, kernexec can provide much the same protection. In mid-January, Kees Cook posted an initial version of the kernexec plugin. The plugin changes the kernel so that, at run time, addresses used to make C function calls always have the high bit set. All kernel functions reside in the kernel address space, which has the high bit set. Since the Linux kernel will never map user-space memory at addresses with the high bit set, attempts to run user-space code by overwriting addresses to point into user space will fail. Instead of executing code at the address arranged for by the attacker, the plugin arranges to trigger a general protection fault instead. Similarly, return addresses are forced at run time to have the high bit set before the return instruction is executed.
The performance impact of kernel hardening efforts is always a concern, so the plugin attempts to optimize the calls and return instructions. If a register is available, the call site simply does a logical-or of the address and 0x8000000000000000 that it loads into the register. For the return, it uses a bit-set instruction (btsq) to set the high bit of the return address on the stack.
Cook notes that there is "significant coverage missing
" with
this version of the plugin. It is missing the assembly language
pieces, which means that assembly code can still make calls into
or return to user-space addresses. That infrastructure still needs to be
ported over from PaX, he said.
structleak
Kernel structures (or fields contained within them) are often copied to user space. If those structures are not initialized, though, they can contain "interesting" values that have lingered in the kernel's memory. If an attacker can arrange for those values to line up with the structure and get them copied to user space, the result is a kernel information leak. CVE-2013-2141 was a leak of that sort; it led "PaX Team" (who develops the PaX patch set) to create the structleak plugin.
Cook also posted a port of that plugin to the kernel mailing list on January 13. It looks for the __user attribute (which is an annotation that is used to indicate user-space pointers) on fields in structures declared as variables local to a function. If those variables are not initialized (thus would still contain "garbage" from the stack), the plugin zeroes them out. In that way, if those values get copied to user space at some point, there will be no exposure of kernel memory contents.
PaX Team commented on the patch posting, mostly suggesting tweaks to some of the text accompanying the plugin. In particular, Cook had changed the description of the plugin in the Kconfig description from what is in PaX. However, Cook had reasonable justifications for most of those changes.
In addition,
the wording of a Kconfig option that turns on verbose mode for structleak
(GCC_PLUGIN_STRUCTLEAK_VERBOSE) did not meet with PaX Team's approval.
It notes that false positives can be reported since "not all
existing initializers are detected by the plugin
", but PaX Team
objected to that characterization: "a variable either has a
constructor or it does not ;)
". But Cook looks at things a bit
differently:
Beyond wording issues, though, as Mark Rutland pointed out, the __user annotation is not a true indication that there is a problem:
He suggested that analyzing calls to copy_to_user() and friends
might allow better detection. PaX Team agreed, but said that the original idea was to
find a simple pattern to match to eliminate CVE-2013-2141 and other, similar
bugs. Now that the bug is fixed, it is unclear if the plugin is actually
blocking any problems, but
there is little reason not to keep it, PaX Team said: "i keep this plugin around because
it costs nothing to maintain
it and the alternative (better) solution doesn't exist yet.
"
These are both fairly straightforward hardening features that may prevent kernel bugs from being (ab)used by attackers. Structleak may not truly be needed at this point, but new code could introduce a similar problem and the plugin is not particularly intrusive. Kernexec, on the other hand, has the potential to stop attacks that rely on the kernel executing user-space code in their tracks. While both plugins have existed out of tree for some time, getting them upstream so that distributors can start building their kernels that way, thus get them in the hands of more Linux users, can only be a good thing. Hopefully we will see some of the others make their way into the mainline before too long as well.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
