Kernel development
Brief items
Kernel release status
The current development kernel is 4.0-rc1, released on February 22. As can be seen, Linus decided in the end to call this release "4.0". "But nobody should notice. Because moving to 4.0 does *not* mean that we somehow changed what people see. It's all just more of the same, just with smaller numbers so that I can do releases without having to take off my socks again." The codename has also changed to "Hurr durr I'ma sheep."
Stable updates: none have been released in the last week. The 3.18.8, 3.14.34, and 3.10.70 updates are in the review process as of this writing; they can be expected on or after February 27.
Quotes of the week
Lazytime hits a snag
The "lazytime" concept was first posted by Ted Ts'o in November 2014. It attempts to address the performance costs of tracking the access time of each file while maintaining a more accurate notion of the last access time than the "relatime" option provides. In short, the last-access time is always kept current for as long as a file's inode is in memory; it is only written to persistent store if (1) there is another reason to write out the inode, or (2) the inode is being evicted from the cache. After a rewrite (to make it work at the virtual filesystem layer rather than being an ext4-specific option), lazytime was merged for the upcoming 4.0 kernel.That does not necessarily mean that 4.0 users will be able to enable this option, though. Jan Kara has identified some problems with the implementation that can cause incorrect times to be recorded in some situations. The issues look serious enough that use of lazytime in its current form is probably not a good idea. Ted is looking into the report, noting that the option can be disabled before the 4.0 release if the problems are not easily fixed.
According to Jan, that chances are that an easy fix will not be within reach. So it may well be that, while the 4.0 kernel will have the lazytime code, users will not yet have access to it. Having kernel features work as intended before exposing them to users is one thing developers cannot be lazy about.
Kernel development news
The end of the 4.0 merge window
By the time Linus released 4.0-rc1 on February 22, 8,950 non-merge changesets had been pulled into the mainline repository for this development cycle. The changes pulled are the usual mix for the end of the merge window, with fixes starting to dominate over new features. Still, a few new things were to be found in the 1,100 changes pulled since last week's summary, including:
- The overlayfs union filesystem can now support multiple read-only
layers.
- The virtio subsystem has been updated
for compliance with the recently
adopted virtio 1.0 standard.
- The Btrfs filesystem has received a set of out-of-space-handling fixes
resulting from its use at Facebook. The pull request suggests there
will be more of these coming in the future.
- The dm-crypt device mapper target has seen a number of scalability
improvements that improve
its performance on larger systems.
- New hardware support includes:
- Systems and processors:
Intel Quark X1000 SoC boards and
MIPS processors running MIPS32 Release 6.
- Clock:
TI CDCE706 clock synthesizers and
Qualcomm IPQ806x and APQ8064/MSM8960 LPASS clock controllers.
- Miscellaneous:
Hisilicon NAND flash controllers,
Renesas R-Car Gen2 DMA controllers,
IMG multi-threaded DMA controllers,
Allwinner SoC pulse-width modulator (PWM) controllers,
Imagination Technologies PWM controllers,
Intel Baytrail I2C semaphores, and
Broadcom iProc I2C controllers.
- Power management:
Richtek RT5033 power management ICs,
Dialog Semiconductor DA9150 charger fuel-gauge chips, and
Qualcomm resource power managers.
- Watchdog: Imagination Technologies PDC watchdog timers and Mediatek SoC integrated watchdogs.
- Systems and processors:
Intel Quark X1000 SoC boards and
MIPS processors running MIPS32 Release 6.
The indications at the beginning of the merge window were that this would be a relatively small development cycle. In fact, as can be seen in the table below, one has to go back to 3.6 (released in September 2012) to find a merge window with fewer patches:
Patches pulled during
the merge windowRelease Patches 4.0 8,950 3.19 11,408 3.18 9,711 3.17 10,872 3.16 11,364 3.15 12,034 3.14 10,622 3.13 10,518 3.12 9,479 3.11 9,494 3.10 11,963 3.9 10,265 3.8 10,901 3.7 10,409 3.6 8,587 3.5 9,534 3.4 9,248 3.3 8,899 3.2 10,214 3.1 7,202 3.0 7,333
From the table, one can see that there is a natural ebb and flow to the kernel development process; sometimes there is simply more going on than others. The overall trend remains in the upward direction, though, with the number of changes going into the kernel growing over the long (or even medium) term.
As was expected, Linus has bumped the major version number of this release to "4". There is little significance to this change beyond the fact that the minor numbers were getting large.
This development cycle has now moved into the stabilization phase where the remaining bugs are (hopefully) found and fixed. The last three development cycles have been exactly nine weeks long; if that pattern holds this time around as well, the 4.0 kernel will be released on April 12.
A rough patch for live patching
One of the headline features in the upcoming 4.0 kernel is live patching — the ability to apply a patch to a running kernel and fix a problem without disrupting the operation of the system. The truth of the matter, though, is that the live-patching support merged for 4.0 is only the beginning of the story; quite a bit more work will have to be done to have full support for this feature in the kernel. And now it seems that this work may take a bit longer than the developers involved had hoped; indeed, one prominent developer is calling for the entire concept to be rethought.The code merged for 4.0 is a common core that is able to support patching with both kpatch and kGraft. It provides an API that allows patch-containing modules to be inserted into the kernel; it also allows the listing and removal of patches if need be. This API performs the low-level redirection needed to replace patched functions. That is good as far as it goes, but it is missing an important component, called the "consistency model," that ensures the safety of switching between versions of a function in a running kernel. If the change is simple, it may be possible to safely make the change at any time. More complicated changes, though, may require that no kernel code is running in any of the affected functions before the switch can be done. The consistency model as found in kpatch and kGraft is where some of the biggest differences between those two implementations lie, so some work will clearly be needed to bring them together.
As originally developed, kpatch worked by calling stop_machine() to bring the entire system to a halt. It then would check the stack of every process in the system to ensure that none are running within the function(s) to be patched; if the affected functions are not currently running, the patch can proceed, otherwise the operation fails. KGraft, instead, used a "two-universe" model where every process in the system is switched from the old code to the new at a "safe" point. The most common safe point is exit from a system call; at that point, the process cannot be running in any kernel code.
A unified consistency model
Both approaches have their advantages and disadvantages; an attempt to unite them would, hopefully, take the best from each. And that is what Josh Poimboeuf tried to do with his consistency model patch set posted in early February. This approach retains the two-universe model from kGraft, but it uses the stack-trace checking from kpatch to accelerate the task of switching processes to the new code. In theory, this technique increases the chances of successfully applying patches while doing away with kpatch's disruptive stop_machine() call and much of kGraft's higher code complexity.
The first objections to be raised focused on one particular aspect of the consistency code: the stack check. As Peter Zijlstra put it:
Ingo Molnar also came out against the use of stack traces. It comes down to the fact that getting a reliable stack trace out of a process running in kernel space is not as easy as one might expect. There have been lots of bugs in that code in the past, and each architecture brings its own set of special glitches to deal with. And, as Ingo pointed out:
What that means is that a bug in the traceback code is quite likely to stay out of sight until some distributor issues a live patch, at which point things will go badly wrong. The idea of things going badly wrong and disrupting a running system is just what users calling for live patching are most wanting to avoid, so one can imagine that widespread unhappiness would ensue. But it is a risk that will always be hard to avoid, since the correct functioning of the kernel does not otherwise depend on perfectly accurate stack traces.
There are a number of approaches to consistency, and not all of them use stack traces. Given the opposition to that idea, it seems likely that future proposals will omit that technique. But that leaves open the question of what will be used. Ingo is pushing strongly for an approach that forces every process in the system into a quiescent, non-kernel state before applying a patch. It is arguably the simplest approach; it also puts the kernel in a state where it is easy to know that applying the patch is a safe thing to do.
But, as it turns out, the "simplest" approach still has a fair number of tricky details. Kernel threads cannot be pushed out of kernel space, so some other solution must be found for them. Processes that are blocked in the kernel for some sort of long-term wait need to be unblocked, preferably in a way that can be restarted transparently once the patching process is complete. That could require changes to the implementation of a lot of system calls — and, perhaps, a lot of drivers as well. Some ideas for simplifying this task have circulated, but it would take a while to get an implementation to the point where it would reliably succeed in patching a running kernel.
An alternative would be to just go with the kGraft two-universe model, which does not depend on stack traces. The downside with this approach is that the process of trapping every process in a safe place can take an unbounded period of time during which the system is in a weird intermediate state. Yet another alternative is to do without the consistency model entirely. That would severely limit the range of patches that could be applied, but it seems that most security fixes (involving, say, the addition of a simple range check) could still be applied to a running system.
Live kernel upgrades
Perhaps feeling that he had not stirred the anthill sufficiently, Ingo went
on to propose giving up on both kpatch and
kGraft, saying "I think they are fundamentally misguided in both
implementation and in design, which turns them into an (unwilling) extended
arm of the security theater
". Rather than trying to patch a running
kernel, he suggested, why not just save the entire state of the system,
boot into an entirely new kernel, then restore the previous state on top of
the new kernel? That would get rid of consistency models, greatly expand
the range of patches that can be applied, and, in theory, would be more
robust.
This idea is not new, of course. The developers working on CRIU (checkpoint-restore in user space) have had seamless kernel upgrades in their list of use cases for a while, and they evidently have it working for some workloads. But making this functionality work robustly on all systems would require a great deal of extra work to snapshot the full system state (including the state of devices) and restore it all under an arbitrarily different kernel. Vojtech Pavlik, one of the developers behind kGraft, estimated that it would take ten years to make such a system work.
The users asking for live patching, it is safe to say, would not be thrilled about the prospect of waiting that long. It is also far from clear that the full-upgrade technique, once it actually works, can ever be fast enough to keep those users happy. Ingo estimated that a live upgrade could complete within ten seconds, but that is an eternity to users who find even subsecond stalls for patching to be overly disruptive. So, while there is widespread agreement that live upgrades are an interesting and possibly useful technology, there is little chance that any of the developers currently working on live patching will decide to refocus their efforts on live upgrades.
So work on live patching will continue, but it is not clear what direction that work will take. The hopes of getting the consistency-model code ready for the 4.1 merge window now seem somewhat remote; getting consensus on a design that can be merged could take some time. So, while it is still possible that the kernel will have an essentially complete live-patching feature by the end of the year, it may happen rather closer to the end of the year than the developers involved might have hoped for.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Device driver infrastructure
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
