Kernel development [LWN.net]

Kernel release status

The current development kernel is 3.14-rc5, released on March 2. Linus says: "Not a lot. Which is just how I like it. Go verify that it all works for you."

Stable updates: no stable updates have been released in the last week. The 3.13.6 and 3.10.33 updates are in the review process as of this writing; they can be expected on or after March 6.

Comments (none posted)

Quotes of the week

I guess this tips the balance from "you must be crazy to show the source code for your GPU and risk getting sued" to "how do you expect to stay in business without a free driver".

— Arnd Bergmann

Honestly since no one cares enough to maintain the kernel code properly I really think we should just rip audit out instead trying to present userspace with the delusion that the code works, and will continue to work properly.

— Eric Biederman

Comments (none posted)

Red Hat's dynamic kernel patching project

It seems that Red Hat, too, has a project working on patching running kernels. "kpatch allows you to patch a Linux kernel without rebooting or restarting any processes. This enables sysadmins to apply critical security patches to the kernel immediately, without having to wait for long-running tasks to complete, users to log off, or scheduled reboot windows. It gives more control over uptime without sacrificing security or stability." It looks closer to ksplice than to SUSE's kGraft in that it patches out entire functions at a time.

Comments (16 posted)

SUSE Labs Director Talks Live Kernel Patching with kGraft (Linux.com)

Libby Clark talks with Vojtech Pavlik, Director of SUSE Labs, about kGraft. "In this Q&A, Pavlik goes into more detail on SUSE's live kernel patching project; how the kGraft patch integrates with the Linux kernel; how it compares with other live-patching solutions; how developers will be able to use the upcoming release; and the project's interaction with the kernel community for upstream acceptance."

Comments (2 posted)

Broadcom releases SoC graphics driver source

Broadcom has announced the release of the source and documentation for its VideoCore IV graphics subsystem. This subsystem is found in the Raspberry Pi processor, among others. "The trend over the last decade has leaned towards greater openness in desktop graphics, and the same is happening in the mobile space. Broadcom — a long-time leader in graphics processors — is a frontrunner in this movement and aims to contribute to its momentum."

Comments (28 posted)

Finding the proper scope of a file collapse operation

By Jonathan Corbet
March 5, 2014

System call design is never easy; there are often surprising edge cases that developers fail to consider as they settle on an interface. System calls involving filesystems seem to be especially prone to this kind of problem, since the complexity and variety of filesystem implementations means that there may be any number of surprises waiting for a developer who wants to create a new file-oriented operation. Some of these surprises can be seen in the discussion of a proposed addition to the fallocate() system call.

fallocate() is concerned with the allocation of space within a file; its initial purpose was to allow an application to allocate blocks to a file prior to writing them. This type of preallocation ensures that the needed space is available before trying to write the data that goes there; it can also help filesystem implementations lay out the allocated space more efficiently on disk. Later on, the FALLOC_FL_PUNCH_HOLE operation was added to deallocate blocks within a file, leaving a hole in the file.

In February, Namjae Jeon proposed a new fallocate() operation called FALLOC_FL_COLLAPSE_RANGE; this proposal included implementations for the ext4 and xfs filesystems. Like the hole-punching operation, it removes data from a file, but there is a difference: rather than leaving a hole in the file, this operation moves all data beyond the affected range to the beginning of that range, shortening the file as a whole. The immediate user for this operation would appear to be video editing applications, which could use it to quickly and efficiently remove a segment of a video file. If the removed range is block-aligned (which would be a requirement, at least for some filesystems), the removal could be effected by changing the file's extent maps, with no actual copying of data required. Given that files containing video data can be large, it is not hard to understand why an efficient "cut" operation would be attractive.

So what kinds of questions arise with an operation like this? One could start with the interaction with the mmap() system call, which maps a file into a process's address space. The proposed implementation works by removing all pages from the affected range to the end of the file from the page cache; dirty pages are written back to disk first. That will prevent the immediate loss of data that may have been written via a mapping, and will get rid of any memory pages that will be after the end of the file once the operation is complete. But it could be a surprise for a process that does not expect the contents of a file to shift around underneath its mapping. That is not expected to be a huge problem; as Dave Chinner pointed out, the types of applications that would use the collapse operation do not generally access their files via mmap(). Beyond that, applications that are surprised by a collapsed file may well be unable to deal with other modifications even in the absence of a collapse operation.

But, as Hugh Dickins noted, there is a related problem: in the tmpfs filesystem, all files live in the page cache and look a lot like a memory mapping. Since the page cache is the backing store, removing file pages from the page cache is unlikely to lead to a happy ending. So, before tmpfs could support the collapse operation, a lot more effort would have to go into making things play well with the page cache. Hugh was not sure that there would ever be a need for this operation in tmpfs, but, he said, solving the page cache issues for tmpfs would likely lead to a more robust implementation for other filesystems as well.

Hugh also wondered whether the uni-directional collapse operation should, instead, be designed to work in both directions:

I'm a little sad at the name COLLAPSE, but probably seven months too late to object. It surprises me that you're doing all this work to deflate a part of the file, without the obvious complementary work to inflate it - presumably all those advertisers whose ads you're cutting out, will come back to us soon to ask for inflation, so that they have somewhere to reinsert them.

Andrew Morton went a little further, suggesting that a simple "move these blocks from here to there" system call might be the best idea. But Dave took a dim view of that suggestion, worrying that it would introduce a great deal of complexity and difficult corner cases:

IOWs, collapse range is a simple operation, "move arbitrary blocks from here to there" is a nightmare both from the specification and the implementation points of view.

Andrew disagreed, claiming that a more general interface was preferable and that the problems could be overcome, but nobody else supported him on this point. So, chances are, the operation will remain confined to collapsing chunks out of files; a separate "insert" operation may be added in the future, should an interesting use case for it be found.

Meanwhile, there is one other behavioral question to answer; what happens if the region to be removed from the file reaches to the end of the file? The current patch set returns EINVAL in that situation, with the idea that a call to truncate() should be used instead. Ted Ts'o asked whether such operations should just be turned directly into truncate() calls, but Dave is set against that idea. A collapse operation that includes the end of the file, he said, is almost certainly buggy; it is better to return an error in that case.

There are also, evidently, some interesting security issues that could come up if a collapse operation were allowed to include the end of the file. Filesystems can allocate blocks beyond the end of the file; indeed, fallocate() can be used to explicitly request that behavior. Those blocks are typically not zeroed out by the filesystem; instead, they are kept inaccessible so that whatever stale data is contained there cannot be read. Without a great deal of care, a collapse implementation that allowed the range to go beyond the end of the file could end up exposing that data, especially if the operation were to be interrupted (by a system crash, perhaps) in the middle. Rather than set that sort of trap for filesystem developers, Dave would prefer to disallow the risky operations from the beginning, especially since there does not appear to be any real need to support them.

So the end result of all this discussion is that the FALLOC_FL_COLLAPSE_RANGE operation is likely to go into the kernel essentially unchanged. It will not have all the capabilities that some developers would have liked to see, but it will support one useful feature that should help to accelerate a useful class of applications. Whether this will be enough for the long term remains to be seen; system call API design is hard. But, should additional features be needed in the future, new FALLOC_FL commands can be created to make them available in a compatible way.

Comments (13 posted)

Tracing unsigned modules

By Jake Edge
March 5, 2014

The reuse of one of the "tainted kernel" flags by the signed-module loading code has led to a problem using tracepoints in unsigned kernel modules. The problem is fairly easily fixed, but there was opposition to doing so, at least until a "valid" use case could be found. Kernel hackers are not particularly interested in helping out-of-tree modules, and fixing the problem was seen that way—at first, anyway.

Loadable kernel modules have been part of the kernel landscape for nearly 20 years (kernel 1.2 in 1995), but have only recently gained the ability to be verified by a cryptographic signature, so that only "approved" modules can be loaded. Red Hat kernels have had the feature for some time, though it was implemented differently than what eventually ended up in the kernel. Basically, the kernel builder can specify a key to be used to sign modules; the private key gets stored (or discarded after signing), while the public key is built into the kernel.

There are several kernel configuration parameters that govern module signing: CONFIG_MODULE_SIG controls whether the code to do signature checking is enabled at all, while CONFIG_MODULE_SIG_FORCE determines whether all modules must be signed. If CONFIG_MODULE_SIG_FORCE is not turned on (and the corresponding kernel boot parameter module.sig_enforce is not present), then modules without signatures or those using keys not available to the kernel will still be loaded. In that case, though, the kernel will be marked as tainted.

The taint flag used, though, is the same as that used when the user does a modprobe --force to force the loading of a module built for a different kernel version: TAINT_FORCED_MODULE. Force-loading a module is fairly dangerous and can lead to kernel crashes due to incompatibilities between the module's view of memory layout and the kernel's. That can lead to crashes that the kernel developers are not interested in spending time on. So, force-loading a module taints the kernel to allow those bug reports to be quickly skipped over.

But loading an unsigned module is not likely to lead to a kernel crash (or at least, not because it is unsigned), so using the TAINT_FORCED_MODULE flag in that case is not particularly fair. The tracepoint code discriminates against force-loaded modules because enabling tracepoints in mismatched modules could easily lead to a system crash. The tracepoint code does allow TAINT_CRAP (modules built from the staging tree) and TAINT_OOT_MODULE (out-of-tree modules) specifically, but tracepoints in the modules get silently disabled if there is any other taint flag.

Mathieu Desnoyers posted an RFC patch to change that situation. It added a new TAINT_UNSIGNED_MODULE flag that got set when those modules were loaded. It also changed the test in the tracepoint code to allow tracing for the new taint type. It drew an immediate NAK from Ingo Molnar, who did not find Desnoyers's use case at all compelling: "External modules should strive to get out of the 'crap' and 'felony law breaker' categories and we should not make it easier for them to linger in a broken state."

But the situation is not as simple as Molnar seems to think. There are distribution kernels that turn on signature checking, but allow users to decide whether to require signatures by using module.sig_enforce. Since it is the distribution's key that is stored in the kernel image, strict enforcement would mean that only modules built by the distribution could be loaded. That leaves out a wide variety of modules that a user might want to load: modules under development, back-ported modules from later kernels, existing modules being debugged, and so on.

Module maintainer Rusty Russell was fairly unimpressed with the arguments given in favor of the change, at least at first, noting that the kernel configuration made it clear that users needed to arrange to sign their own modules: "Then you didn't do that. You broke it, you get to keep both pieces." That's not exactly the case, though, since CONFIG_MODULE_SIG_FORCE was not set (nor was module.sig_enforce passed on the kernel command line), so users aren't actually required to arrange for signing. But Russell was looking for "an actual valid use case".

The problem essentially boils down to the fact that the kernel is lying when it uses TAINT_FORCED_MODULE for a module that, in fact, hasn't been forced. Steven Rostedt tried to make that clear: "Why the hell are we setting a FORCED_MODULE flag when no module was forced????". He also noted that he is often the one to get bug reports from folks whose tracepoints aren't showing up because they didn't sign their module. As he pointed out, it is a silent failure and the linkage between signed modules and tracepoints is not particularly obvious.

Johannes Berg was eventually able to supply the kind of use case Russell was looking for, though. In his message, he summarized the case for unsigned modules nicely:

The mere existence of a configuration to allow unsigned modules would indicate that there are valid use cases for that (and rebuilding a module or such for development would seem to be one of them), so why would tracing be impacted, particularly for development.

Berg also provided another reason for loading unsigned modules: backported kernel modules from the kernel.org wiki to support hardware (presumably, in his case, wireless network hardware) features not present in the distribution-supplied drivers. He was quite unhappy to hear those kinds of drivers, which "typically only diverge from upstream by a few patches", characterized as crap or law-breaking.

Berg's use case was enough for Russell to agree to the change and to add it to his pending tree. We should see Desnoyers's final patch, which has some cosmetic changes from the RFC, in 3.15. At that point, the kernel will be able to distinguish between these two different kinds of taint and users will be able to trace modules they have loaded, signed or unsigned.

Comments (27 posted)

Optimizing VMA caching

By Jonathan Corbet
March 5, 2014

The kernel divides each process's address space into virtual memory areas (VMAs), each of which describes where the associated range of addresses has its backing store, its protections, and more. A mapping created by mmap(), for example, will be represented by a single VMA, while mapping an executable file into memory may require several VMAs; the list of VMAs for any process can be seen by looking at /proc/PID/maps. Finding the VMA associated with a specific virtual address is a common operation in the memory management subsystem; it must be done for every page fault, for example. It is thus not surprising that this mapping is highly optimized; what may be surprising is the fact that it can be optimized further.

The VMAs for each address space are stored in a red-black tree, which enables a specific VMA to be looked up in logarithmic time. These trees scale well, which is important; some processes can have hundreds of VMAs (or more) to sort through. But it still takes time to walk down to a leaf in a red-black tree; it would be nice to avoid that work at least occasionally if it were possible. Current kernels work toward that goal by caching the results of the last VMA lookup in each address space. For workloads with any sort of locality, this simple cache can be quite effective, with hit rates of 50% or more.

But Davidlohr Bueso thought it should be possible to do better. Last November, he posted a patch adding a second cache holding a pointer to the largest VMA in each address space. The logic was that the VMA with the most addresses would see the most lookups, and his results seemed to bear that out; with the largest-VMA cache in place, hit rates went to over 60% for some workloads. It was a good improvement, but the patch did not make it into the mainline. Looking at the discussion, one can quickly come up with a useful tip for aspiring kernel developers: if Linus responds by saying "This patch makes me angry", the chances of it being merged are relatively small.

Linus's complaint was that caching the largest VMA seemed "way too ad-hoc" and wouldn't be suitable for a lot of workloads. He suggested caching a small number of recently used VMAs instead. Additionally, he noted that maintaining a single cache per address space, as current kernels do, might not be a good idea. In situations where multiple threads are running in the same address space, it is likely that each thread will be working with a different set of VMAs. So making the cache per-thread, he said, might yield much better results.

A few iterations later, Davidlohr has posted a VMA-caching patch set that appears to be about ready to go upstream. Following Linus's suggestion, the single-VMA cache (mmap_cache in struct mm_struct) has been replaced by a small array called vmacache in struct task_struct, making it per-thread. On systems with a memory management unit (almost all systems), that array holds four entries. There are also new sequence numbers stored in both struct mm_struct (one per address space) and in struct task_struct (one per thread).

The purpose of the sequence numbers is to ensure that the cache does not return stale results. Any change to the address space (the addition or removal of a VMA, for example) causes the per-address-space sequence number to be incremented. Every attempt to look up an address in the per-thread cache first checks the sequence numbers; if they do not match, the cache is deemed to be invalid and will be reset. Address-space changes are relatively rare in most workloads, so the invalidation of the cache should not happen too often.

Every call to find_vma() (the function that locates the VMA for a virtual address) first does a linear search through the cache to see if the needed VMA is there. Should the VMA be found, the work is done; otherwise, a traversal of the red-black tree will be required. In this case, the result of the lookup will be stored back into the cache. That is done by overwriting the entry indexed by the lowest bits of the page-frame number associated with the original virtual address. It is, thus, a random replacement policy for all practical purposes. The caching mechanism is meant to be fast so there would probably be no benefit from trying to implement a more elaborate replacement policy.

How well does the new scheme work? It depends on the workload, of course. For system boot, where almost everything running is single-threaded, Davidlohr reports that the cache hit rate went from 51% to 73%. Kernel builds, unsurprisingly, already work quite well with the current scheme with a hit rate of 75%, but, even in this case, improvement is possible: that rate goes to 88% with Davidlohr's patch applied. The real benefit, though, can be seen with benchmarks like ebizzy, which is designed to simulate a multithreaded web server workload. Current kernels find a cached VMA in a mere 1% of lookup attempts; patched kernels, instead, show a 99.97% hit rate.

With numbers like that, it is hard to find arguments for keeping this patch out of the mainline. At this point, the stream of suggestions and comments has come to a halt. Barring surprises, a new VMA lookup caching mechanism seems likely to find its way into the 3.15 kernel.

Comments (15 posted)

Linus Torvalds Linux 3.14-rc5 ?

Steven Rostedt 3.10.32-rt30 ?

Steven Rostedt 3.10.32-rt31 ?

Steven Rostedt 3.4.82-rt100 ?

Steven Rostedt 3.2.55-rt78 ?

Markus Mayer ARM: bcm21664: Add initial support. ?

Sebastian Capella hibernation support on ARM ?

Maxime COQUELIN Add STiH407 SoC and reference board support ?

Liviu Dudau [RFC] Add support for PCI in AArch64 ?

AKASHI Takahiro arm64: Add audit support ?

AKASHI Takahiro arm64: Add ftrace support ?

Cody P Schafer powerpc: Add support for Power Hypervisor supplied performance counters ?

Matt Fleming EFI mixed mode ?

Masami Hiramatsu [PATCH -tip v7 00/26] kprobes: introduce NOKPROBE_SYMBOL, bugfixes and scalbility efforts ?

Stefani Seibold Add 32 bit VDSO time function support ?

Lukasz Majewski cpufreq:LAB: Support for LAB governor. ?

Vincent Guittot rework sched_domain topology description ?

Eugene Shatokhin KernelStrider 0.3 ?

Alexei Starovoitov Extended BPF, converter, seccomp, doc ?

Xiubo Li Add Freescale FTM PWM support ?

Gabriel FERNANDEZ clk: st: Add new driver ?

Gabriel FERNANDEZ Add ST Keyscan driver ?

Loc Ho PHY: Add APM X-Gene SoC 15Gbps Multi-purpose PHY support ?

Krzysztof Kozlowski mfd: sec: Add support for S2MPS14 ?

Georgi Djakov mmc: sdhci-msm: Add support for Qualcomm chipsets ?

Krzysztof Kozlowski rtc: s5m: Add support for S2MPS14 ?

Maxime Ripard Add support for the Allwinner A31 DMA Controller ?

Ivan Khoronzhuk Introduce keystone reset driver ?

Santosh Shilimkar soc: Introduce drivers/soc and add Keystone QMSS driver ?

Soren Brinkmann i2c: Add driver for Cadence I2C controller ?

Yuvaraj Kumar C D Exynos5250 SATA Support ?

Kamil Debski phy: Add new Exynos USB 2.0 PHY driver ?

Tomasz Figa Generic Device Tree based power domain look-up ?

Vince Bridgers Altera Triple Speed Ethernet (TSE) Driver RFC ?

Ken Cox staging: Drivers to support Unisys Secure Partitioning ?

Jyri Sarha AM43xx-ePOS-EVM audio support with TLV320AIC31XX driver ?

Daniel Jeong add new Dual LED FLASH LM3646 ?

Antti Palosaari SDR API - V4L API stuff itself ?

Antti Palosaari SDR API - Mirics MSi3101 driver ?

Antti Palosaari SDR API - Realtek RTL2832 SDR driver ?

Mauro Carvalho Chehab Add support for ATSC PCTV 80e USB stick ?

Jenny TC power_supply: Introduce power supply charging driver ?

Kedareswara rao Appana can: xilinx CAN controller support. ?

Christoffer Dall ARM VM System Sepcification ?

Alex Thorlton mm, thp: Add mm flag to control THP ?

Kirill A. Shutemov mm: map few pages around fault address if they are in page cache ?

Davidlohr Bueso mm: per-thread vma caching ?

Sergey Senozhatsky add compressing abstraction and multi stream support ?

Mark Salter mm: generic early_ioremap support ?

David Rientjes userspace out of memory handling ?

H. Peter Anvin RDSEED support for the Linux kernel ?

Andi Kleen perf: Implement lbr-as-callgraph v4 ?

Andi Kleen perf: Add support for full Intel event lists ?

Don Zickus perf, c2c: Add new tool to analyze cacheline contention on NUMA systems ?

Mathieu Desnoyers Userspace RCU 0.8.2 ?

Pavel Emelyanov Checkpoint-restore tool v1.2 ?

Jozsef Kadlecsik ipset 6.21 released ?

Kernel development

Brief items

Kernel release status

Quotes of the week

Red Hat's dynamic kernel patching project

SUSE Labs Director Talks Live Kernel Patching with kGraft (Linux.com)

Broadcom releases SoC graphics driver source

Kernel development news

Finding the proper scope of a file collapse operation

Tracing unsigned modules

Optimizing VMA caching

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Memory management

Security-related

Miscellaneous