Kernel development
Brief items
Kernel release status
The current development kernel is 3.14-rc5, released on March 2. Linus says: "Not a lot. Which is just how I like it. Go verify that it all works for you."
Stable updates: no stable updates have been released in the last week. The 3.13.6 and 3.10.33 updates are in the review process as of this writing; they can be expected on or after March 6.
Quotes of the week
Red Hat's dynamic kernel patching project
It seems that Red Hat, too, has a project working on patching running kernels. "kpatch allows you to patch a Linux kernel without rebooting or restarting any processes. This enables sysadmins to apply critical security patches to the kernel immediately, without having to wait for long-running tasks to complete, users to log off, or scheduled reboot windows. It gives more control over uptime without sacrificing security or stability." It looks closer to ksplice than to SUSE's kGraft in that it patches out entire functions at a time.
SUSE Labs Director Talks Live Kernel Patching with kGraft (Linux.com)
Libby Clark talks with Vojtech Pavlik, Director of SUSE Labs, about kGraft. "In this Q&A, Pavlik goes into more detail on SUSE's live kernel patching project; how the kGraft patch integrates with the Linux kernel; how it compares with other live-patching solutions; how developers will be able to use the upcoming release; and the project's interaction with the kernel community for upstream acceptance."
Broadcom releases SoC graphics driver source
Broadcom has announced the release of the source and documentation for its VideoCore IV graphics subsystem. This subsystem is found in the Raspberry Pi processor, among others. "The trend over the last decade has leaned towards greater openness in desktop graphics, and the same is happening in the mobile space. Broadcom — a long-time leader in graphics processors — is a frontrunner in this movement and aims to contribute to its momentum."
Kernel development news
Finding the proper scope of a file collapse operation
System call design is never easy; there are often surprising edge cases that developers fail to consider as they settle on an interface. System calls involving filesystems seem to be especially prone to this kind of problem, since the complexity and variety of filesystem implementations means that there may be any number of surprises waiting for a developer who wants to create a new file-oriented operation. Some of these surprises can be seen in the discussion of a proposed addition to the fallocate() system call.fallocate() is concerned with the allocation of space within a file; its initial purpose was to allow an application to allocate blocks to a file prior to writing them. This type of preallocation ensures that the needed space is available before trying to write the data that goes there; it can also help filesystem implementations lay out the allocated space more efficiently on disk. Later on, the FALLOC_FL_PUNCH_HOLE operation was added to deallocate blocks within a file, leaving a hole in the file.
In February, Namjae Jeon proposed a new fallocate() operation called FALLOC_FL_COLLAPSE_RANGE; this proposal included implementations for the ext4 and xfs filesystems. Like the hole-punching operation, it removes data from a file, but there is a difference: rather than leaving a hole in the file, this operation moves all data beyond the affected range to the beginning of that range, shortening the file as a whole. The immediate user for this operation would appear to be video editing applications, which could use it to quickly and efficiently remove a segment of a video file. If the removed range is block-aligned (which would be a requirement, at least for some filesystems), the removal could be effected by changing the file's extent maps, with no actual copying of data required. Given that files containing video data can be large, it is not hard to understand why an efficient "cut" operation would be attractive.
So what kinds of questions arise with an operation like this? One could start with the interaction with the mmap() system call, which maps a file into a process's address space. The proposed implementation works by removing all pages from the affected range to the end of the file from the page cache; dirty pages are written back to disk first. That will prevent the immediate loss of data that may have been written via a mapping, and will get rid of any memory pages that will be after the end of the file once the operation is complete. But it could be a surprise for a process that does not expect the contents of a file to shift around underneath its mapping. That is not expected to be a huge problem; as Dave Chinner pointed out, the types of applications that would use the collapse operation do not generally access their files via mmap(). Beyond that, applications that are surprised by a collapsed file may well be unable to deal with other modifications even in the absence of a collapse operation.
But, as Hugh Dickins noted, there is a related problem: in the tmpfs filesystem, all files live in the page cache and look a lot like a memory mapping. Since the page cache is the backing store, removing file pages from the page cache is unlikely to lead to a happy ending. So, before tmpfs could support the collapse operation, a lot more effort would have to go into making things play well with the page cache. Hugh was not sure that there would ever be a need for this operation in tmpfs, but, he said, solving the page cache issues for tmpfs would likely lead to a more robust implementation for other filesystems as well.
Hugh also wondered whether the uni-directional collapse operation should, instead, be designed to work in both directions:
Andrew Morton went a little further, suggesting that a simple "move these
blocks from here to there
" system call might be the best idea. But
Dave took a dim view of that suggestion,
worrying that it would introduce a great deal of complexity and difficult
corner cases:
Andrew disagreed, claiming that a more general interface was preferable and that the problems could be overcome, but nobody else supported him on this point. So, chances are, the operation will remain confined to collapsing chunks out of files; a separate "insert" operation may be added in the future, should an interesting use case for it be found.
Meanwhile, there is one other behavioral question to answer; what happens if the region to be removed from the file reaches to the end of the file? The current patch set returns EINVAL in that situation, with the idea that a call to truncate() should be used instead. Ted Ts'o asked whether such operations should just be turned directly into truncate() calls, but Dave is set against that idea. A collapse operation that includes the end of the file, he said, is almost certainly buggy; it is better to return an error in that case.
There are also, evidently, some interesting security issues that could come up if a collapse operation were allowed to include the end of the file. Filesystems can allocate blocks beyond the end of the file; indeed, fallocate() can be used to explicitly request that behavior. Those blocks are typically not zeroed out by the filesystem; instead, they are kept inaccessible so that whatever stale data is contained there cannot be read. Without a great deal of care, a collapse implementation that allowed the range to go beyond the end of the file could end up exposing that data, especially if the operation were to be interrupted (by a system crash, perhaps) in the middle. Rather than set that sort of trap for filesystem developers, Dave would prefer to disallow the risky operations from the beginning, especially since there does not appear to be any real need to support them.
So the end result of all this discussion is that the FALLOC_FL_COLLAPSE_RANGE operation is likely to go into the kernel essentially unchanged. It will not have all the capabilities that some developers would have liked to see, but it will support one useful feature that should help to accelerate a useful class of applications. Whether this will be enough for the long term remains to be seen; system call API design is hard. But, should additional features be needed in the future, new FALLOC_FL commands can be created to make them available in a compatible way.
Tracing unsigned modules
The reuse of one of the "tainted kernel" flags by the signed-module loading code has led to a problem using tracepoints in unsigned kernel modules. The problem is fairly easily fixed, but there was opposition to doing so, at least until a "valid" use case could be found. Kernel hackers are not particularly interested in helping out-of-tree modules, and fixing the problem was seen that way—at first, anyway.
Loadable kernel modules have been part of the kernel landscape for nearly 20 years (kernel 1.2 in 1995), but have only recently gained the ability to be verified by a cryptographic signature, so that only "approved" modules can be loaded. Red Hat kernels have had the feature for some time, though it was implemented differently than what eventually ended up in the kernel. Basically, the kernel builder can specify a key to be used to sign modules; the private key gets stored (or discarded after signing), while the public key is built into the kernel.
There are several kernel configuration parameters that govern module signing: CONFIG_MODULE_SIG controls whether the code to do signature checking is enabled at all, while CONFIG_MODULE_SIG_FORCE determines whether all modules must be signed. If CONFIG_MODULE_SIG_FORCE is not turned on (and the corresponding kernel boot parameter module.sig_enforce is not present), then modules without signatures or those using keys not available to the kernel will still be loaded. In that case, though, the kernel will be marked as tainted.
The taint flag used, though, is the same as that used when the user does a modprobe --force to force the loading of a module built for a different kernel version: TAINT_FORCED_MODULE. Force-loading a module is fairly dangerous and can lead to kernel crashes due to incompatibilities between the module's view of memory layout and the kernel's. That can lead to crashes that the kernel developers are not interested in spending time on. So, force-loading a module taints the kernel to allow those bug reports to be quickly skipped over.
But loading an unsigned module is not likely to lead to a kernel crash (or at least, not because it is unsigned), so using the TAINT_FORCED_MODULE flag in that case is not particularly fair. The tracepoint code discriminates against force-loaded modules because enabling tracepoints in mismatched modules could easily lead to a system crash. The tracepoint code does allow TAINT_CRAP (modules built from the staging tree) and TAINT_OOT_MODULE (out-of-tree modules) specifically, but tracepoints in the modules get silently disabled if there is any other taint flag.
Mathieu Desnoyers posted an RFC patch to
change that situation. It added a new TAINT_UNSIGNED_MODULE flag
that got set when those modules were loaded. It also changed the test in
the tracepoint code to allow tracing for the new taint type. It drew an
immediate NAK from Ingo Molnar, who did
not find Desnoyers's use case at all compelling: "External modules should strive to get out of the 'crap' and
'felony law breaker' categories and we should not make it
easier for them to linger in a broken state.
"
But the situation is not as simple as Molnar seems to think. There are distribution kernels that turn on signature checking, but allow users to decide whether to require signatures by using module.sig_enforce. Since it is the distribution's key that is stored in the kernel image, strict enforcement would mean that only modules built by the distribution could be loaded. That leaves out a wide variety of modules that a user might want to load: modules under development, back-ported modules from later kernels, existing modules being debugged, and so on.
Module maintainer Rusty Russell was fairly
unimpressed with the arguments given in favor of the change, at least
at first, noting that
the kernel configuration made it clear that users needed to arrange to sign
their own modules: "Then you didn't do that. You broke it, you get to
keep both pieces.
" That's not exactly the case, though, since
CONFIG_MODULE_SIG_FORCE was not set (nor was
module.sig_enforce passed on the kernel command line), so users
aren't actually required to arrange for signing. But Russell was looking
for "an actual valid use case
".
The problem essentially boils down to the fact that the kernel is lying
when it uses TAINT_FORCED_MODULE for a module that, in fact,
hasn't been forced. Steven Rostedt tried to
make that clear: "Why the hell are we setting
a FORCED_MODULE flag when no module was forced????
". He also noted that he is often the one to get bug
reports from folks whose tracepoints aren't showing up because they didn't
sign their module. As he pointed out, it is a silent failure and the
linkage between signed modules and tracepoints is not particularly obvious.
Johannes Berg was eventually able to supply the kind of use case Russell was looking for, though. In his message, he summarized the case for unsigned modules nicely:
Berg also provided another reason for loading unsigned modules: backported
kernel modules from the
kernel.org wiki to support hardware (presumably, in his case, wireless
network
hardware) features not present in the distribution-supplied drivers. He
was quite unhappy to hear those kinds of drivers, which "typically only diverge
from upstream by a few patches
", characterized as crap or
law-breaking.
Berg's use case was enough for Russell to agree to the change and to add it to his pending tree. We should see Desnoyers's final patch, which has some cosmetic changes from the RFC, in 3.15. At that point, the kernel will be able to distinguish between these two different kinds of taint and users will be able to trace modules they have loaded, signed or unsigned.
Optimizing VMA caching
The kernel divides each process's address space into virtual memory areas (VMAs), each of which describes where the associated range of addresses has its backing store, its protections, and more. A mapping created by mmap(), for example, will be represented by a single VMA, while mapping an executable file into memory may require several VMAs; the list of VMAs for any process can be seen by looking at /proc/PID/maps. Finding the VMA associated with a specific virtual address is a common operation in the memory management subsystem; it must be done for every page fault, for example. It is thus not surprising that this mapping is highly optimized; what may be surprising is the fact that it can be optimized further.The VMAs for each address space are stored in a red-black tree, which enables a specific VMA to be looked up in logarithmic time. These trees scale well, which is important; some processes can have hundreds of VMAs (or more) to sort through. But it still takes time to walk down to a leaf in a red-black tree; it would be nice to avoid that work at least occasionally if it were possible. Current kernels work toward that goal by caching the results of the last VMA lookup in each address space. For workloads with any sort of locality, this simple cache can be quite effective, with hit rates of 50% or more.
But Davidlohr Bueso thought it should be possible to do better. Last
November, he posted a patch adding a second
cache holding a pointer to the largest VMA in each address space. The logic
was that the VMA with the most addresses would see the most lookups, and
his results seemed to bear that out; with the largest-VMA cache in place,
hit rates went to over 60% for some workloads. It was a good improvement,
but the patch did not make it into the mainline. Looking at the
discussion, one can quickly come up with a useful tip for aspiring kernel
developers: if Linus responds by
saying "
Linus's complaint was that caching the largest VMA seemed "
A few iterations later, Davidlohr has posted a
VMA-caching patch set that appears to be about ready to go upstream.
Following Linus's suggestion, the single-VMA cache (mmap_cache in
struct mm_struct) has been replaced by a small array called
vmacache in struct task_struct, making it per-thread. On
systems with a memory management unit (almost all systems), that array
holds four entries. There are also new sequence numbers stored in both
struct mm_struct (one per address space) and in
struct task_struct (one per thread).
The purpose of the sequence numbers is to ensure that the cache does not
return stale results. Any change to the address space (the addition or
removal of a VMA, for example) causes the per-address-space sequence number
to be incremented. Every attempt to look up an address in the per-thread
cache first
checks the sequence numbers; if they do not match, the cache is deemed to
be invalid and will be reset. Address-space changes are relatively rare in
most workloads, so the invalidation of the cache should not happen too
often.
Every call to find_vma() (the function that locates the VMA for a
virtual address) first does a linear search through the cache to see if the
needed VMA is there. Should the VMA be found, the work is done; otherwise,
a traversal of the red-black tree will be required. In this case, the
result of the lookup will be stored back into the cache. That is done by
overwriting the entry indexed by the lowest bits of the page-frame number
associated with the original virtual address. It is, thus, a random replacement
policy for all practical purposes. The caching mechanism is meant to be
fast so there would probably be no benefit from trying to implement
a more elaborate replacement policy.
How well does the new scheme work? It depends on the workload, of course.
For system boot, where almost everything running is single-threaded,
Davidlohr reports that the cache hit rate went from 51% to 73%. Kernel
builds, unsurprisingly, already work quite well with the current scheme
with a hit rate of 75%, but, even in this case, improvement is possible:
that rate goes to 88% with Davidlohr's patch applied. The real benefit,
though, can be seen with benchmarks like ebizzy, which is
designed to simulate a multithreaded web server workload. Current kernels
find a cached VMA in a mere 1% of lookup attempts; patched kernels,
instead, show a 99.97% hit rate.
With numbers like that, it is hard to find arguments for keeping this patch
out of the mainline. At this point, the stream of suggestions and comments
has come to a halt. Barring surprises, a new VMA lookup caching mechanism
seems likely to find its way into the 3.15 kernel.
This patch makes me angry
", the chances of it being
merged are relatively small.
way too
ad-hoc
" and wouldn't be suitable for a lot of workloads. He
suggested caching a small number of recently used VMAs instead.
Additionally, he noted that maintaining a single cache per address space,
as current kernels do, might not be a good idea. In situations where
multiple threads are running in the same address space, it is likely that
each thread will be working with a different set of VMAs. So making the
cache per-thread, he said, might yield much better results.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Memory management
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
