LWN.net Logo

Kernel development

Brief items

Kernel release status

The current development kernel is 3.7-rc3, released on October 28. Linus notes that it's mostly a lot of small changes in a lot of places. But he has found a new problem to be concerned about: "And talking about the shortlog: christ people, some of you need to change your names. I'm used to there being multiple 'David's and 'Peter's etc, but there are three different Linus's in just this rc. People, people, I want to feel like the unique snowflake I am, not like just another anonymous guy in a crowd."

Stable updates: 3.0.49, 3.4.16, and 3.6.4 all came out on October 28; they were followed by 3.0.50, 3.2.33, 3.4.17 and 3.6.5 on October 31. All contain another set of important fixes. Worth noting is the fact that 3.6.5 disables by default the hard and soft link security restrictions added during the 3.6 merge window in response to another reported regression.

Comments (none posted)

Quotes of the week

And the next technology journalist that asks you whether you want fonts that small, I'll just hunt down and give an atomic wedgie.
Linus Torvalds doesn't do blocking wedgies

And suddenly causing a complete cessation of vm scanning at a particular magic threshold seems rather crude, compared to some complex graduated thing which will also always do the wrong thing, only more obscurely ;)
Andrew Morton

You will get this message once a day until you've dealt with these bugs!
bugzilla@kernel.org failing to win friends and influence developers

Comments (18 posted)

Kroah-Hartman: Help wanted

Greg Kroah-Hartman is looking for somebody to help him put stable kernels together. "I'm looking for someone to help me out with the stable Linux kernel release process. Right now I'm drowning in trees and patches, and could use some one to help me sanity-check the releases I'm doing."

Comments (10 posted)

Airlie: raspberry pi drivers are NOT useful

Kernel graphics maintainer Dave Airlie is rather unimpressed with the Raspberry Pi driver release; it is not something that will ever be merged. "Why is this bad? You cannot make any improvements to their GLES implementation, you cannot add any new extensions, you can't fix any bugs, you can't do anything with it. You can't write a Mesa/Gallium driver for it. In other words you just can't."

Comments (79 posted)

Kernel development news

A potential NUMA scheduling solution

By Jonathan Corbet
October 31, 2012
Earlier this year, two different developers set out to create a solution to the problem of performance (or the lack thereof) on non-uniform memory access (NUMA) systems. The Linux kernel's scheduler will freely move processes around to maximize CPU utilization on large systems; unfortunately, on NUMA systems, that can lead to processes being separated from their memory, reducing performance considerably. Two very different solutions to the problem were posted, leaving no clear path toward a single solution that could be merged into the mainline. Now, perhaps, that single solution exists, but the way that solution came about raises some questions.

The first approach was Peter Zijlstra's sched/numa patch set. It added a "lazy migration" mechanism (implemented by Lee Schermerhorn) that uses soft page faults to move useful pages to the NUMA node where they were actually being used. On top of that, it implemented a new "home node" concept that keeps the scheduler from moving processes between NUMA nodes whenever possible; it also tries to make memory allocations happen on the allocating process's home node. Finally, there was a pair of system calls allowing a process to change its home node and to form groups of processes that should all run on the same home node.

Andrea Arcangeli's AutoNUMA patch set, instead, was more strongly focused on migrating pages to the nodes where they are actually being used. To that end, it created a tracking mechanism (again, using page faults) to figure out where page accesses were coming from; there was a new kernel thread to perform this tracking. Whenever the generated statistics revealed that too many pages were being accessed from remote nodes, the kernel would consider either relocating the processes performing those accesses or relocating the pages; either way, the goal was to get both the processes and the pages on the same node.

To say that the two developers disagreed on the right solution is to understate the case considerably. Peter claimed that AutoNUMA abused the scheduler, added too much memory overhead, and slowed scheduling decisions unacceptably. Andrea responded that sched/numa would not work well, especially for larger jobs, without manual tweaking by developers and/or system administrators. The conversation was rather less than polite at times — until it went silent altogether. Peter last responded to the AutoNUMA discussion at the end of June — this example demonstrates the level of the discussion at that time — and the last sched/numa posting happened at the end of July.

The silence ended on October 25 with Peter's posting of the numa/core patch set. The patch introduction reads:

Here's a re-post of the NUMA scheduling and migration improvement patches that we are working on. These include techniques from AutoNUMA and the sched/numa tree and form a unified basis - it has got all the bits that look good and mergeable....

These patches will continue their life in tip:numa/core and unless there are major showstoppers they are intended for the v3.8 merge window. We believe that they provide a solid basis for future work.

It is worth noting that the value of "we" is not well defined anywhere in the patch set.

Numa/core brings in much of the sched/numa patch set, including the lazy migration scheme, the memory policy changes, and the home node concept. The core scheduler change tries to keep processes on their home node by adding resistance to moving a process away from that node, and by trying to push misplaced processes back to the home node during load balancing. There is also a feature to wake sleeping processes on the home node regardless of where they were running before, but it is disabled because "we found this to be far too aggressive." Missing from this patch set is the proposed numa_tbind() and numa_mbind() system calls; it's not clear whether those are meant to be added later.

The patch set also includes some ideas from AutoNUMA. The page structure gains a new last_nid field to record the ID of the NUMA node last observed to access the page. That new field will cause struct page to grow on 32-bit systems, which is never a popular thing to do. It is expected, though, that most systems where better NUMA scheduling really matters will be 64-bit.

Scanning of memory is still done: pages are marked as being absent so that usage patterns can be observed from the resulting soft faults. But the kernel thread to perform this scanning no longer exists; it is, instead, done by each process in its own context. The number of pages scanned is proportional to each process's run time, so little effort is put into the scanning of pages belonging to processes that rarely run. Scanning does not start until a given process has accumulated at least one second of run time. It makes sense that there is little value in optimizing the NUMA placement of short-lived processes; in this case, that intuition was confirmed with an improvement in the all-important kernel-compilation benchmark. Most of the memory overhead added by the original AutoNUMA patches has been removed.

Thus far, there has been little in the way of reviews of this large patch set, and no benchmark results posted. Things will have to pick up on that front if a patch set of this size is going to be ready by the time the 3.8 merge window opens. The numa/core patches may improve NUMA scheduling, and they may be the right basis to move forward with, but the development community as a whole does not know that yet.

There is one other thing that jumps out at an attentive observer. These patches credit Andrea's work with a set of Suggested-by and Based-on-idea-by tags, but none of them are signed off by Andrea. It would appear that, while some of his ideas have found their way into this patch set, his code has not. But, despite the fact that he did not write this code, Andrea has been conspicuously absent from the review discussion.

In the absence of any further information, it is hard not to conclude that Andrea has removed himself from this particular project. Certainly Red Hat cannot be faulted if it is unable to feel entirely comfortable when some of its highest-profile engineers are fighting among themselves in a public forum. So it is not hard to imagine that the developers involved were given clear instructions to resolve the situation. If that were the case, we would have a solution that was arrived at as much by Red Hat management as by the wider development community.

Such speculation (and it certainly is no more than that), of course, says nothing about the quality of the current patch set. That will be judged by the development community, presumably between now and when the 3.8 merge window opens. Assuming the patches pass this review, we should have an improved NUMA scheduler and an end to an ongoing dispute. As the number of NUMA (and NUMA-like) systems grows, that can only be a good thing.

Comments (9 posted)

Relocating RCU callbacks

By Jonathan Corbet
October 31, 2012
The read-copy-update (RCU) subsystem is one of the kernel's key scalability mechanisms; it is usually invoked in situations where normal locking is far too slow. RCU is known to be complex code, to the point that lesser kernel developers will happily proclaim that they do not understand it. That should not be taken to mean that RCU cannot be made faster or more complex, though. Paul McKenney's "callback-free CPUs" patch set is a case in point.

Much RCU processing has traditionally been done in software interrupt (softirq) context, meaning that the actual processing is done at seemingly random times during the execution of whatever process happens to have the CPU at the time. Softirqs thus have the potential to add arbitrary delays to the execution of any process, regardless of that process's priority. It is not surprising that the realtime developers have been working on the softirq problem; non-realtime developers, too, have been known to grumble about softirq overhead. Depending on the load on the system, RCU processing can be a significant part of the overall softirq workload. So improvements in RCU processing can help eliminate unwanted latencies and jitter even if software interrupt handling as a whole remains unchanged.

Paul recently described some work in that direction on this page; as of the 3.6 kernel, much of the RCU grace period handling has been moved to kernel threads. RCU works by replacing a data structure with a modified version, retaining the old copy but hiding it from view so that no new references to it will be created. The RCU rules guarantee that any data structure made inaccessible in this way before a "grace period" passes will have no outstanding references after that period; the determination of grace periods is thus a crucial step in the cleanup and deletion of those old data structures. It turns out that identifying grace periods in a scalable and efficient manner is not a trivial task; see, for example, this article for details.

Moving grace period handling to kernel threads takes a certain amount of RCU overhead out of the softirq path, reducing jitter and allowing that handling to be assigned priorities like any other process. But, even with grace period processing out of the way, RCU still has a fair amount of work to do in softirq context. Near the top of the list is the calling of RCU callbacks — the functions that actually perform cleanup work after a grace period passes. With some workloads, the number of callbacks can get quite large. Users concerned about jitter have expressed a desire to move as much kernel processing out of the way as possible; RCU callback processing represents a significant chunk of that work.

That is the motivation for Paul's callback-free CPUs patch set. The idea is simple enough: rather than invoke RCU callbacks in softirq context, the kernel can just shunt that work off to yet another kernel thread. The implementation, of course, is just a bit more involved than that.

The patch set adds a new rcu_nocbs= boot-time parameter allowing the system administrator to specify a set of CPUs to run in the "no callbacks" mode. It is not possible to do so with every CPU in the system; at least one processor must remain in the traditional mode or grace period processing will not function properly. In practical terms, that means that CPU0 cannot be run in the no-callbacks mode and any attempt to hot-remove the last traditional-RCU CPU will fail.

When a CPU (call it CPUN) runs without RCU callbacks, there will be a separate rcuoN process charged with callback handling. When that process wakes up, it will grab the list of outstanding callbacks for its assigned CPU, using some tricky atomic-exchange techniques to avoid the need for explicit locking. The thread will wait for the grace period to expire, then run through the callbacks; after that the cycle begins anew. Normally the process wakes up when callbacks are added to an empty list, but a separate boot parameter instructs the threads to poll occasionally for new work instead. Polling has its costs, especially on systems where energy efficiency and letting CPUs sleep are priorities, but it can improve RCU's CPU efficiency, helping throughput.

Users who are so sensitive to jitter that they want to reconfigure RCU callback processing may not be satisfied just by having that processing move to a thread that competes with their workload. The good news for those users is that, once callback processing lives in its own thread, it can be assigned a priority that fits with the overall goals of the system. Perhaps even better, the callback thread does not have to run on the CPU whose callbacks it is handling; by playing with CPU affinities, administrators can move that work to other CPUs, freeing the no-callback CPUs to focus more exclusively on the user's workload.

No-callback CPUs are thus part of the larger effort toward fully-dedicated CPUs that run nothing but the user's processes. The idea is that, on such a CPU, the workload would be fully in charge and need never worry that the kernel would get in the way when there is time-sensitive work to be done. Solving that problem in a robust and maintainable manner is a rather larger problem; it requires the NoHZ mechanism and more. It has been recognized for some time that this problem will need to be solved in smaller pieces; the no-callback CPUs patch is one of those pieces.

This patch set is in its second iteration; comments this time around have been scarce. Barring surprises, it would not be surprising to see this feature pushed into the 3.8 kernel. Most users will not care, but, for those who obsess about latency and jitter, it should be a welcome addition.

Comments (none posted)

Thoughts on the ext4 panic

By Jonathan Corbet
October 29, 2012
In just a few days, a linux-kernel mailing list report of ext4 filesystem corruption turned into a widely-distributed news story; the quality of ext4 and its maintenance, it seemed, was in doubt. Once the dust settled, the situation turned out to be rather less grave than some had thought; the bug in question only threatened a very small group of ext4 users using non-default mount options. As this is being written, a fix is in testing and should be making its way toward the mainline and stable kernels shortly. The bug was obscure, but there is value in looking at how it came about and the ripples it caused.

The timeline

On October 23, user "Nix" was trying to help track down an NFS lock manager crash when he ran into a little problem: the crash kept corrupting his filesystem, making the debugging task rather more difficult than it would otherwise have been. He reported the problem to the linux-kernel mailing list; he also posted a warning for other LWN readers. The ext4 developers moved quickly to find the problem, coming up with a hypothesis within a few hours of the initial report. Unfortunately, the hypothesis turned out to be wrong.

Before that became clear, though, a number of news outlets had posted articles on the problem. LWN was not the first to do so ("first" is not at the top of our list of priorities), but, late on the 24th, we, too, posted an item about the issue. It quickly became clear, though, that the original hypothesis did not hold water, and that further investigation was in order. That investigation, as it turns out, took a few days to play out.

Eric Sandeen eventually tracked the problem down to this commit which found its way into the mainline during the 3.4 merge window. That change was meant to be a cleanup, gathering the inode allocation logic into a single function and removing some duplicated code. The unintended result was to cause the inode bitmap to be modified outside of a transaction, introducing unchecksummed data into the journal. If the system crashed during that time, the next mount would encounter checksum errors and refuse to play back the journal; the filesystem was then seen as being corrupt.

The interesting thing is that, on most systems, this problem will never come about because, on those systems, the journal checksums do not actually exist. Journal checksumming is an optional feature, not enabled by default, and, evidently, not widely used. Nix had turned on the feature somewhat inadvertently; most other users do not turn it on at all, even if they are aware it exists. Anybody who has journal checksums turned off will not be affected by this bug, so very few ext4 users needed to be concerned about potential data corruption.

As an interesting aside, checksums on the journal are a somewhat problematic feature; as seen in this discussion from 2008, it is not at all clear what the best response should be when journal checksums fail to match. The journal checksum may not be information that the system can reasonably act upon; indeed, as in this case, it may create problems of its own.

Eric's patch appears to fix the problem; corrupted journals that were easily observed before its application do not happen afterward. There will naturally be a period of review and testing before this change is merged into the mainline — nobody wants to create a new problem through undue haste — but kernel releases with a version of the fix (it has already been revised once) should be available to users in short order. But most users will not really care, since they were not affected by the problem in the first place. They may care more about the plans to improve the filesystem test suites so that regressions of this nature can be more easily caught in the future.

Analysis

In retrospect, the media coverage of this bug was clearly out of proportion to that bug's impact. One might attribute that to a desire for sensational stories to drive traffic, and that may well be part of what was going on. But there are a couple of other factors that are worth keeping in mind before jumping to that judgment:

  • Many media outlets employ editors and writers who, almost beyond belief, are not trained in kernel programming. That makes it very hard for them to understand what is really going on behind a linux-kernel discussion even if they read that discussion rather than basing a story on a single message received in a tip. They will see a subject like "Apparent serious progressive ext4 data corruption," along with messages from prominent developers seemingly confirming the problem, and that is what they have to go with. It is hard to blame them for seeing a major story in this thread.

  • Even those who understand linux-kernel discussions (LWN, in its arrogance, places itself in this category) can be faced with an urgent choice. If there were a data corruption bug in recent kernels, then we would be beyond remiss to fail to warn our readers, many of whom run the kernels in question. There comes a point where, in the absence of better information, there is no alternative to putting something out there.

The ext4 developers certainly cannot be faulted for the way this story went. They did what conscientious developers do: they dropped everything to focus on what appeared to be a serious regression affecting their users. They might have avoided some of the splash by taking the discussion private and not saying anything until they were certain of having found the real problem, but that is not the way our community works. It is hard to imagine that pushing development discussions out of the public view is going to make things better in the long run.

Thus, one might conclude that we are simply going to see an occasional episode like this, where a bug report takes on a life of its own and is widely distributed before its impact is truly understood. Early reports of software problems, arguably, should be treated like early software: potentially interesting, but likely to be in need of serious review and debugging. That's simply the world we live in.

A more serious concern may apply to the addition of features to the ext4 filesystem. Ext4 is viewed as the stable, production filesystem in the Linux kernel, the one we're supposed to use while waiting for Btrfs to mature. One might well question the addition of new features to this filesystem, especially features that prove to be rarely used or that don't necessarily play well with existing features. And, sure enough, Linux filesystem developers have raised just this kind of worry in the past. In the end, though, the evolution of ext4 is subject to the same forces as the rest of the kernel; it will go in the directions that its developers drive it. There is interest in enhancing ext4, so new features will find their way in.

Before getting too worried about this prospect, though, it is worth thinking about the history of ext4. This filesystem is heavily used with all kinds of workloads; any problems lurking within will certainly emerge to bite somebody. But problems that have affected real users have been exceedingly rare and, even in this case, the number of affected users appears to be countable without running out of fingers. Ext4, in other words, has a long and impressive record of stability, and its developers are determined to keep it that way; this bug can be viewed as the sort of exception that proves the rule. One should never underestimate the value of good backups, but, with ext4, the chances of having to actually use those backups remain quite small.

Comments (81 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Security-related

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds