Kernel development
Brief items
Kernel release status
The current development kernel is 3.16-rc2, released on June 21. Linus said: "It's a day early, but tomorrow ends up being inconvenient for me due to being on the road most of the day, so here you are. These days most people send me their pull requests and patches during the week, so it's not like I expect that a Sunday release would have made much of a difference. And it's also not like I didn't have enough changes for making a rc2 release."
Stable updates: none have been released in the last week. The 3.15.2, 3.14.9, 3.10.45, and 3.4.95 updates are in the review process as of this writing; they can be expected on or after June 26.
Quotes of the week
Kernel development news
RCU, cond_resched(), and performance regressions
Performance regressions are a constant problem for kernel developers. A seemingly innocent change might cause a significant performance degradation, but only for users and workloads that the original developer has no access to. Sometimes these regressions can lurk for years until the affected users update their kernels and notice that things are running more slowly. The good news is that the development community is responding with more testing aimed at detecting performance regressions. This testing found a classic example of this kind of bug in 3.16; the bug merits a look as an example of how hard it can be to keep things working optimally for a wide range of users.
The birth of a regression
The kernel's read-copy-update (RCU) mechanism enables a great deal of kernel scalability by facilitating lock-free changes to data structures and batching of cleanup operations. A fundamental aspect of RCU's operation is the detection of "quiescent states" on each processor; a quiescent state is one in which no kernel code can hold a reference to any RCU-protected data structure. Initially, quiescent states were defined as times when the processor was running in user space, but things have gotten rather more complex since then. (See LWN's lengthy list of RCU articles for lots of details on how this all works).
The kernel's full tickless mode, which is only now becoming ready for serious use, can make the detection of quiescent states more difficult. A CPU running in the tickless mode will, due to the constraints of that mode, be running a single process. If that process stays within the kernel for a long time, no quiescent states will be observed. That, in turn, prevents RCU from declaring the end of a "grace period" and running the (possibly lengthy) set of accumulated RCU callbacks. Delayed grace periods can result in excessive latencies elsewhere in the kernel or, if things go really badly, out-of-memory problems.
One might argue (as some developers did) that code that loops in the kernel in this way already has serious problems. But such situations do come about. Eric Dumazet mentioned one: a process calling exit() when it has thousands of sockets open. Each of those open sockets will result in structures being freed via RCU; that can lead to a long list of work to be done while that same process is still closing sockets and, thus, preventing RCU processing by looping in the kernel.
RCU developer Paul McKenney put together a solution to this problem based on a simple insight: the kernel already has a mechanism for allowing other things to happen while some sort of lengthy operation is in progress. Code that is known to be prone to long loops will, on occasion, call cond_resched() to give the scheduler a chance to run a higher-priority process. In the tickless situation, there will be no higher-priority process, though, so, in current kernels, cond_resched() does nothing of any use in the tickless mode.
But kernel code can only call cond_resched() in places where it can handle being scheduled out of the CPU. So it cannot be running in an atomic context and, thus, cannot hold references to any RCU-protected data structures. In other words, a call to cond_resched() marks a quiescent state; all that is needed is to tell RCU about it.
As it happens, cond_resched() is called in a lot of performance-sensitive places, so it is not possible to add a lot of overhead there. So Paul did not call into RCU to signal a quiescent state with every cond_resched() call; instead, that function was modified to increment a per-CPU counter and, using that counter, only call into RCU once for every 256 (by default) cond_resched() calls. That appeared to fix the problem with minimal overhead, so the patch was merged during the 3.16 merge window.
Soon thereafter, Dave Hansen reported that one of his benchmarks (a program which opens and closes a lot of files while doing little else) had slowed down, and that, with bisection, he had identified the cond_resched() change as the culprit. Interestingly, the problem is not with cond_resched() itself, which remained fast as intended. Instead, the change caused RCU grace periods to happen more often than before; that caused RCU callbacks to be processed in smaller batches and led to increased contention in the slab memory allocator. By changing the threshold for quiescent states from every 256 cond_resched() calls to a much larger number, Dave was able to get back to a 3.15 level of performance.
Fixing the problem
One might argue that the proper fix is simply to raise that threshold for all users. But doing so doesn't just restore performance; it also restores the problem that the cond_resched() change was intended to fix. The challenge, then, is finding a way to fix one workload's problem without penalizing other workloads.
There is an additional challenge in that some developers would like to make cond_resched() into a complete no-op on fully preemptable kernels. After all, if the kernel is preemptable, there should be no need to poll for conditions that would require calling into the scheduler; preemption will simply take care of that when the need arises. So fixes that depend on cond_resched() continuing to do something may fail on preemptable kernels in the future.
Paul's first fix took the form of a series of patches making changes in a few places. There was still a check in cond_resched(), but that check took a different form. The RCU core was modified to take note when a specific processor holds up the conclusion of a grace period for an excessive period of time; when that condition was detected, a per-CPU flag would be set. Then, cond_resched() need only check that flag and, if it is set, note the passing of a quiescent period. That change reduced the frequency of grace periods, restoring much of the lost performance.
In addition, Paul introduced a new function called cond_resched_rcu_qs(), otherwise known as "the slow version of cond_resched()". By default, it does the same thing as ordinary cond_resched(), but the intent is that it would continue to perform the RCU grace period check even if cond_resched() is changed to skip that check — or to do nothing at all. The patch changed cond_resched() calls to cond_resched_rcu_qs() in a handful of strategic places where problems have been observed in the past.
This solution worked, but it left some developers unhappy. For those who are trying to get the most performance out of their CPUs, any overhead in a function like cond_resched() is too much. So Paul came up with a different approach that requires no checks in cond_resched() at all. Instead, when the RCU core notices that a CPU has held up the grace period for too long, it sends an inter-processor interrupt (IPI) to that processor. That IPI will be delivered when the target processor is not running in atomic context; it is, thus, another good time to note a quiescent state.
This solution might be surprising at first glance: IPIs are expensive and, thus, are not normally seen as the way to improve scalability. But this approach has two advantages: it removes the monitoring overhead from the performance-sensitive CPUs, and the IPIs only happen when a problem has been detected. So, most of the time, it should have no impact on CPUs running in the tickless mode at all. It would thus appear that this solution is preferable, and that this particular performance regression has been solved.
How good is good enough?
At least, it would appear that way if it weren't for the fact that Dave still observes a slowdown, though it is much smaller than it was before. The solution is, thus, not perfect, but Paul is inclined to declare victory on this one anyway:
Dave still isn't entirely happy with the
situation; he noted that the regression is closer to 10% with the default
settings, and said "This change of existing behavior removes some of
the benefits that my system gets out of RCU
". Paul responded that he is "not at all
interested in that micro-benchmark becoming the kernel's
straightjacket
" and sent in a pull
request including the second version of the fix. If there are any
real-world workloads that are adversely affected by this change, he
suggested, there are a number of ways to tune the system to mitigate the
problem.
Regardless of whether this issue is truly closed or not, this regression demonstrates some of the hazards of kernel development on contemporary systems. Scalability pressures lead to complex code trying to ensure that everything happens at the right time with minimal overhead. But it will never be possible for a developer to test with all possible workloads, so there will often be one that shows a surprising performance regression in response to a change. Fixing one workload may well penalize another; making changes that do not hurt any workloads may be close to impossible. But, given enough testing and attention to the problems revealed by the tests, most problems can hopefully be found and corrected before they affect production users.
Reworking kexec for signatures
The kernel execution (kexec) subsystem allows a running kernel to switch to a different kernel. This allows for faster booting, as the system firmware and bootloader are bypassed, but it can also be used to produce crash dumps using Kdump. However, as Matthew Garret explained on his blog, kexec could be used to circumvent UEFI secure boot restrictions, which led him to propose a way to disable kexec on secure boot systems. That was not terribly popular, but a more recent patch set would provide a path for kexec to only boot signed kernels, which would solve the problem Garrett was trying to address, without completely disabling the facility.
The kexec subsystem consists of the kexec_load() system call that loads a new kernel into memory, which can then be booted using the reboot() system call. There is also a kexec command that will both load the new kernel and boot it, without entering the system firmware (e.g. BIOS or UEFI) and bootloader.
But the UEFI firmware is what enforces the secure boot restrictions. Garrett was concerned that a Linux kernel could be used to boot an unsigned (and malicious) Windows operating system by way of kexec because it circumvents secure boot. That might lead Microsoft to blacklist the keys used to sign Linux bootloaders, which would make it difficult to boot Linux on commodity hardware. Using kexec that way could affect secure-booted Linux systems too, of course, though Microsoft might not be so quick to revoke keys under those circumstances.
In any case, Garrett eventually removed the kexec-disabling portion of his patch set (though he strongly suggested that distributions should still disable kexec if they are going to support secure boot). Those patches have not been merged (yet?). More recently, Vivek Goyal has put together a patch set that is intended to address Garrett's secure boot concerns, but would also protect systems that only allow loading signed kernel modules. As Garrett showed in his blog post, that restriction can be trivially bypassed by executing a new kernel that simply alters the sig_enforce sysfs parameter in the original kernel's memory and then jumps back to that original kernel.
Goyal's patches start down the path toward being able to restrict kexec so that it will only load signed code. To that end, this patch set defines a new system call:
long kexec_file_load(int kernel_fd, int initrd_fd,
const char *cmdline_ptr, unsigned long cmdline_len,
unsigned long flags);
It will load the kernel executable from the kernel_fd file
descriptor and will associate the "initial ramdisk" (initrd) from the
initrd_fd descriptor. It will also associate the kernel command
line passed as cmdline_ptr and cmdline_len. The initrd
and command-line information will be used when the kernel is actually
booted. This contrasts with the existing kexec system call:
long kexec_load(unsigned long entry, unsigned long nr_segments,
struct kexec_segment *segments, unsigned long flags);
It expects to get segments that have been parsed out of a kernel
binary in user space and to just blindly load them into memory. As can be
seen, kexec_file_load() puts the kernel in the loop so that it can
(eventually) verify what is being loaded and executed.
As one of the segments that get loaded, there is a standalone executable object, called "purgatory", that runs between the two kernels. At reboot() time, the "exiting" kernel jumps to the purgatory code. Its main function is to check the SHA-256 hashes of the other segments that were loaded. If those have not been corrupted, booting can proceed. The purgatory code will copy some memory to a backup region and do some architecture-specific setup, then jump to the new kernel.
The purgatory code currently lives in kexec-tools, but if the kernel is to take responsibility for setting up the segments from the kernel binary and initrd, it will need a purgatory of its own. Goyal's patch set adds that code for x86 to arch/x86/purgatory/.
Goyal also copied code from crypto/sha256_generic.c into the purgatory directory. It's clear he would rather simply just use the code directly from the crypto/ directory, but could not find a way to do so:
So instead of doing #include on sha256_generic.c I just copied relevant portions of code into arch/x86/purgatory/sha256.c. Now we shouldn't have to touch this code at all. Do let me know if there are better ways to handle it.
While the patch set is at version 3 (earlier versions: v2, v1), it is still a "request for comment" (RFC) patch. There are various unfinished pieces, with signature verification topping the list. So far, the new facility is only available for the x86_64 architecture and bzImage kernel images. Adding other architectures and support for the ELF kernel format still remain to be done. There is also a need for some documentation, including a man page.
Goyal did explain his vision for how the signature verification will work. It is based on David Howells's work on verifying the signatures for loadable kernel modules. Essentially, the signature will be verified when kexec_load_file() is called. That is also when the SHA-256 hashes for each segment are calculated and stored in the purgatory segment. So, all purgatory has to do is verify the hashes (which it already does to avoid running corrupted code) to ensure that only a properly signed kernel will be executed.
There have been plenty of comments on each version of the patch set, but most of those on v3 were technical suggestions for improving the code. So far, there have been no complaints about the overall idea, which means we may well see the ability to require cryptographic signatures on the kernels passed to kexec added as a feature sometime in the next year—hopefully sooner than that. It would be a nice feature to have when Garrett's secure boot patches get merged.
Questioning EXPORT_SYMBOL_GPL()
There have been arguments about the legality of binary-only kernel modules for almost as long as the kernel has had loadable module support. One of the key factors in this disagreement is the EXPORT_SYMBOL_GPL() directive, which is intended to keep certain kernel functions out of the reach of proprietary modules. A recent discussion about the merging of a proposed new kernel subsystem has revived some questions about the meaning and value of EXPORT_SYMBOL_GPL() — and whether it is worth bothering with at all.Loadable modules do not have access to every function or variable in the kernel; instead, they can only make use of symbols that have been explicitly "exported" to them by way of the EXPORT_SYMBOL() macro or one of its variants. When plain EXPORT_SYMBOL() is used, any kernel module is able to gain access to the named symbol. If the developer uses EXPORT_SYMBOL_GPL() instead, the symbol will only be made available to modules that have declared that they are distributable under a GPL-compatible license. EXPORT_SYMBOL_GPL() is meant to mark kernel interfaces that are deemed to be so low-level and specific to the kernel that any software that uses them must perforce be a derived product of the kernel. The GPL requires that derived products, if distributed, be made available under the same license; EXPORT_SYMBOL_GPL() is thus a statement that the named symbol should only be used by GPL-compatible code.
It is worth noting that nobody has said that symbols exported with plain EXPORT_SYMBOL() can be freely used by proprietary code; indeed, a number of developers claim that all (or nearly all) loadable modules are derived products of the kernel regardless of whether they use GPL-only symbols or not. In general, the kernel community has long worked to maintain a vague and scary ambiguity around the legal status of proprietary modules while being unwilling to attempt to ban such modules outright.
Shared DMA buffers
Recent years have seen a fair amount of development intended to allow device drivers to share DMA buffers with each other and with user space. A common use case for this capability is transferring video data directly from a camera to a graphics controller, allowing that data to be displayed with no need for user-space involvement. The DMA buffer-sharing subsystem, often just called "dma-buf," is a key part of this functionality. When the dma-buf code was merged in 2012, there was a lengthy discussion on whether that subsystem should be exported to modules in the GPL-only mode or not.
The code as originally written used EXPORT_SYMBOL_GPL(). A representative from NVIDIA requested that those exports be changed to EXPORT_SYMBOL() instead. If dma-buf were to be GPL-only, he said, the result would not be to get NVIDIA to open-source its driver. Instead:
At the time, a number of the developers involved evidently discussed the question at the Embedded Linux Conference and concluded that EXPORT_SYMBOL() was appropriate in this case. Other developers, however, made it clear that they objected to the change. No resolution was ever posted publicly, but the end result is clear: the dma-buf symbols are still exported GPL-only in current kernels.
On the fence
More recently, a major enhancement to dma-buf functionality has come along in the form of the fence synchronization subsystem. A "fence" is a primitive that indicates whether an operation on a dma-buf has completed or not. For the camera device described above, for example, the camera driver could use a fence to signal when the buffer actually contains a new video frame. The graphics driver would then wait for the fence to signal completion before rendering the buffer to the display; it, in turn, could use a fence to signal when the rendering is complete and the buffer can be reused. Fences thus sound something like the completion API, but there is additional complexity there to allow for hardware signaling, cancellation, fences depending on other fences, and more. All told, the fence patches add some 2400 lines of code to the kernel.
The fence subsystem is meant to replace Android-specific code (called "Sync") with similar functionality. Whether that will happen remains to be seen; it seems that the Android developers have not said whether they will be able to use it, and, apparently, not all of the needed functionality is there. But there is another potential roadblock here: GPL-only exports.
The current fence code does not export its symbols with EXPORT_SYMBOL_GPL(); it mirrors the Sync driver (which is in the mainline staging area) in that regard. While he was reviewing the code, driver core maintainer Greg Kroah-Hartman requested that the exports be changed to GPL-only, saying that GPL-only is how the rest of the driver core has been done. That request was not well received by Rob Clark, who said:
(A "syncpt" is an NVIDIA-specific equivalent to a fence).
Greg proved to be persistent in his request, though, claiming that GPL-only exports have made the difference in bringing companies around in the past. Graphics maintainer Dave Airlie, who came down hard on proprietary graphics modules a few years ago, disagreed here, saying that the only thing that has really made the difference has been companies putting pressure on each other. Little else, he said, has been effective despite claims that some in the community might like to make. His vote was for "author's choice" in this case.
Is EXPORT_SYMBOL_GPL() broken?
Dave went on to talk about the GPL-only export situation in general:
The last sentence above might be the most relevant in the end. For years, the kernel community has muttered threateningly about proprietary kernel modules without taking much action to change the situation. So manufacturers continue to ship such modules without much fear of any sort of reprisal. Clearly the community tolerates these modules, regardless of its (often loud) statements about the possible legal dangers that come with distributing them.
Even circumvention of EXPORT_SYMBOL_GPL() limitations seems to be tolerated in the end; developers will complain publicly (sometimes) when it happens, but no further action ensues. So it should not be surprising if companies are figuring out that they need not worry too much about their binary-only modules.
So it is not clear that EXPORT_SYMBOL_GPL() actually helps much at this point. It has no teeth to back it up. Instead, it could be seen as a sort of speed bump that makes life a bit more inconvenient for companies shipping binary-only modules. A GPL-only export lets developers express their feelings, and it may slow things down a bit, but, in many cases at least, these exports do not appear to be changing behavior much. The fence patches, in particular, are aimed at embedded devices, where proprietary graphics drivers are, unfortunately, still the norm. Making the interface be GPL-only is probably not going to turn that situation around.
Perhaps one could argue that EXPORT_SYMBOL_GPL() is a classic example of an attempt at a technical solution to a social problem. If proprietary modules are truly a violation of the rights of kernel developers, then, sooner or later, some of those developers are going to need to take a stand to enforce those rights. The alternative is a world where binary-only kernel drivers are distributed with tacit approval from the kernel community, regardless of how many symbols are marked as being EXPORT_SYMBOL_GPL().
As with the dma-buf case, no resolution to the question of how symbols should be exported from the fence subsystem has been posted. But Greg has said that he will not give up on this particular issue, and, as the maintainer who would normally accept a patch set in this area, he is in a fairly strong position to back up his views. We may have to wait until this code is actually merged to see which position will ultimately prevail. But it seems that, increasingly, some developers will wonder if it even matters.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Device driver infrastructure
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Page editor: Jonathan Corbet
Next page:
Distributions>>
