Brief items
The current development kernel is 3.7-rc3,
released on October 28. Linus notes that
it's mostly a lot of small changes in a lot of places. But he has found a
new problem to be concerned about: "
And talking about the shortlog:
christ people, some of you need to change your names. I'm used to there
being multiple 'David's and 'Peter's etc, but there are three different
Linus's in just this rc. People, people, I want to feel like the unique
snowflake I am, not like just another anonymous guy in a crowd."
Stable updates:
3.0.49,
3.4.16, and
3.6.4 all came out on October 28; they
were followed by
3.0.50,
3.2.33,
3.4.17 and 3.6.5 on October 31. All contain another
set of important fixes. Worth noting is the fact that 3.6.5 disables by default the hard and soft link security
restrictions added during the 3.6 merge window in response to another
reported regression.
Comments (none posted)
And the next technology journalist that asks you whether you want
fonts that small, I'll just hunt down and give an atomic wedgie.
—
Linus
Torvalds doesn't do blocking wedgies
And suddenly causing a complete cessation of vm scanning at a
particular magic threshold seems rather crude, compared to some
complex graduated thing which will also always do the wrong thing,
only more obscurely ;)
—
Andrew Morton
You will get this message once a day until you've dealt with these
bugs!
—
bugzilla@kernel.org failing to win
friends and influence developers
Comments (18 posted)
Greg Kroah-Hartman is
looking for
somebody to help him put stable kernels together. "
I'm looking
for someone to help me out with the stable Linux kernel release
process. Right now I'm drowning in trees and patches, and could use some
one to help me sanity-check the releases I'm doing."
Comments (10 posted)
Kernel graphics maintainer Dave Airlie is
rather unimpressed with the
Raspberry Pi driver release; it is not something that will ever be merged. "
Why is this bad?
You cannot make any improvements to their GLES implementation, you cannot add any new extensions, you can't fix any bugs, you can't do anything with it. You can't write a Mesa/Gallium driver for it. In other words you just can't."
Comments (79 posted)
Kernel development news
By Jonathan Corbet
October 31, 2012
Earlier this year, two different developers set out to create a solution to
the problem of performance (or the lack thereof) on non-uniform memory
access (NUMA) systems. The Linux kernel's scheduler will freely move
processes around to maximize CPU utilization on large systems;
unfortunately, on NUMA systems, that can lead to processes being separated
from their memory, reducing performance considerably. Two very different
solutions to the problem were posted, leaving no clear path toward a
single solution that could be merged into the mainline. Now, perhaps, that
single solution exists, but the way that solution came about raises some
questions.
The first approach was Peter Zijlstra's sched/numa patch set. It added a "lazy
migration" mechanism (implemented by Lee Schermerhorn) that uses soft page
faults to move useful pages to the NUMA node where they were actually being
used. On top of that, it implemented a new "home node" concept that keeps
the scheduler from moving processes between NUMA nodes whenever possible;
it also tries to make memory allocations happen on the allocating process's
home node. Finally, there was a pair of system calls allowing a process to
change its home node and to form groups of processes that should all run on
the same home node.
Andrea Arcangeli's AutoNUMA patch set,
instead, was more strongly focused on migrating pages to the nodes where
they are actually being used. To that end, it created a tracking mechanism
(again, using page faults) to figure out where page accesses were coming
from; there was a new kernel thread to perform this tracking. Whenever the
generated statistics revealed that too many pages were being accessed from
remote nodes, the kernel would consider either relocating the processes
performing those accesses or relocating the pages; either way, the goal was
to get both the processes and the pages on the same node.
To say that the two developers disagreed on the right solution is to
understate the case considerably. Peter claimed that AutoNUMA abused the
scheduler, added too much memory overhead, and slowed scheduling decisions
unacceptably. Andrea responded that sched/numa would not work well,
especially for larger jobs, without manual tweaking by developers and/or
system administrators. The conversation was rather less than polite at
times — until it went silent altogether. Peter last responded to the
AutoNUMA discussion at the end of June — this
example demonstrates the level of the discussion at that time — and the last sched/numa posting happened at the
end of July.
The silence ended on October 25 with Peter's posting of the numa/core patch set. The patch introduction
reads:
Here's a re-post of the NUMA scheduling and migration improvement
patches that we are working on. These include techniques from
AutoNUMA and the sched/numa tree and form a unified basis - it has
got all the bits that look good and mergeable....
These patches will continue their life in tip:numa/core and unless
there are major showstoppers they are intended for the v3.8 merge
window. We believe that they provide a solid basis for future work.
It is worth noting that the value of "we" is not well defined anywhere in
the patch set.
Numa/core brings in much of the sched/numa patch set, including the lazy
migration scheme, the memory policy changes, and the home node concept.
The core scheduler change tries to keep processes on their home node by
adding resistance to moving a process away from that node, and by trying
to push misplaced processes back to the home node during load balancing.
There is also a feature to wake sleeping processes on the home node
regardless of where they were running before, but it
is disabled because "we found this to be far too aggressive."
Missing from this patch set is the proposed numa_tbind() and
numa_mbind() system calls; it's not clear whether those are meant
to be added later.
The patch set also includes some ideas from AutoNUMA. The page
structure gains a new last_nid field to record the ID of the NUMA
node last observed to access the page. That new field will cause
struct page to grow on 32-bit systems, which is never a
popular thing to do. It is expected, though, that most systems where
better NUMA scheduling really matters will be 64-bit.
Scanning of memory is still done:
pages are marked as being absent so that usage patterns can be observed
from the resulting soft faults. But the kernel thread to perform this
scanning no longer exists; it is, instead, done by each process in its own
context. The number of pages scanned is proportional to each process's run
time, so little effort is put into the scanning of pages belonging to
processes that rarely run. Scanning does not start until a given process
has accumulated at least one second of run time. It makes sense that there
is little value in optimizing the NUMA placement of short-lived processes;
in this case, that intuition was confirmed with an improvement in the
all-important kernel-compilation benchmark. Most of the memory overhead
added by the original AutoNUMA patches has been removed.
Thus far, there has been little in the way of reviews of this large patch
set, and no benchmark results posted. Things will have to pick up on that
front if a patch set of this size is going to be ready by the time the 3.8
merge window opens. The numa/core patches may improve NUMA scheduling, and
they may be the right basis to move forward with, but the development
community as a whole does not know that yet.
There is one other thing that jumps out at an attentive observer. These
patches credit Andrea's work with a set of Suggested-by and
Based-on-idea-by tags, but none of them are signed off by Andrea.
It would appear that, while some of his ideas have found their way into
this patch set, his code has not. But, despite the fact that he did not
write this code, Andrea has been conspicuously absent from the review
discussion.
In the absence of any further information, it is hard not to
conclude that Andrea has removed himself from this particular project.
Certainly Red Hat cannot be faulted if it is unable to feel entirely
comfortable when some of its
highest-profile engineers are fighting among themselves in a public forum.
So it is not hard to imagine that the developers involved were given clear
instructions to resolve the situation. If that were the case, we would have a
solution that was arrived at as much by Red Hat management as by the wider
development community.
Such speculation (and it certainly is no more than that), of course,
says nothing about the quality of the current patch set. That will be
judged by the development community, presumably between now and when the
3.8 merge window opens. Assuming the patches pass this review, we should
have an improved NUMA scheduler and an end to an ongoing dispute. As the
number of NUMA (and NUMA-like) systems grows, that can only be a good thing.
Comments (9 posted)
By Jonathan Corbet
October 31, 2012
The read-copy-update (RCU) subsystem is one of the kernel's key scalability
mechanisms; it is usually invoked in situations where normal locking is far
too slow. RCU is known to be complex code, to the point that
lesser kernel developers will happily proclaim
that they do not understand it. That should not be taken to mean that RCU
cannot be made faster or more complex, though. Paul McKenney's
"callback-free CPUs" patch set is a case in point.
Much RCU processing has traditionally been done in software interrupt
(softirq) context, meaning that the actual processing is done at seemingly
random times during the execution of whatever process happens to have the
CPU at the time. Softirqs thus have the potential to add arbitrary delays
to the execution of any process, regardless of that process's priority. It
is not surprising that the realtime developers have been working on the softirq problem;
non-realtime developers, too, have been known to grumble about softirq
overhead. Depending on the load on the system, RCU processing can be a
significant part of the overall softirq workload. So improvements in RCU
processing can help eliminate unwanted latencies and jitter even if
software interrupt handling as a whole remains unchanged.
Paul recently described some work in that
direction on this page; as of the 3.6 kernel, much of the RCU grace
period handling has been moved to kernel threads. RCU works by replacing
a data structure with a modified version, retaining the old copy but hiding
it from view so that no new references to it will be created. The
RCU rules guarantee that any data structure made inaccessible in this way
before a
"grace period" passes will have no outstanding references after that
period; the determination of grace periods is thus a crucial step in the
cleanup and deletion of those old data structures. It turns out that
identifying grace periods in a scalable and efficient manner is not a
trivial task; see, for example, this
article for details.
Moving grace period handling to kernel threads takes a certain amount of
RCU overhead out of the softirq path, reducing jitter and allowing that
handling to be assigned priorities like any other process. But, even with
grace period processing out of the way, RCU still has a fair amount of work
to do in softirq context. Near the top of the list is the calling of RCU
callbacks — the functions that actually perform cleanup work after a grace
period passes. With some workloads, the number of callbacks can get quite
large. Users concerned about jitter have expressed a desire to move as
much kernel processing out of the way as possible; RCU callback
processing represents a significant chunk of that work.
That is the motivation for Paul's callback-free
CPUs patch set. The idea is simple enough: rather than invoke RCU
callbacks in softirq context, the kernel can just shunt that work off to
yet another kernel thread. The implementation, of course, is just a bit
more involved than that.
The patch set adds a new rcu_nocbs= boot-time parameter allowing
the system administrator to specify a set of CPUs to run in the "no
callbacks" mode. It is not possible to do so with every CPU in the system;
at least one processor must remain in the traditional mode or grace period
processing will not function properly. In practical terms, that means that
CPU0 cannot be run in the no-callbacks mode and any attempt to hot-remove
the last traditional-RCU CPU will fail.
When a CPU (call it CPUN) runs without RCU callbacks, there
will be a separate
rcuoN process charged with callback handling. When that
process
wakes up, it will grab the list of outstanding callbacks for its assigned
CPU, using some tricky atomic-exchange techniques to avoid the need for
explicit locking. The thread will wait for the grace period to expire,
then run through the callbacks; after that the cycle begins anew. Normally
the process wakes up when callbacks are added to an empty list, but a
separate boot parameter instructs the threads to poll occasionally for new
work instead. Polling has its costs, especially on systems where energy
efficiency and letting CPUs sleep are priorities, but it can improve RCU's
CPU efficiency, helping throughput.
Users who are so sensitive to jitter that they want to reconfigure RCU
callback processing may not be satisfied just by having that processing
move to a thread that competes with their workload. The good news for
those users is that, once callback processing lives in its own thread, it
can be assigned a priority that fits with the overall goals of the system.
Perhaps even better, the callback thread does not have to run on the CPU
whose callbacks it is handling; by playing with CPU affinities,
administrators can move that work to other CPUs, freeing the no-callback
CPUs to focus more exclusively on the user's workload.
No-callback CPUs are thus part of the larger effort toward fully-dedicated
CPUs that run nothing but the user's processes. The idea is that, on such
a CPU, the workload would be fully in charge and need never worry that the
kernel would get in the way when there is time-sensitive work to be done.
Solving that problem in a robust and maintainable manner is a rather larger
problem; it requires the NoHZ mechanism and
more. It has been recognized for some time that this problem will need to
be solved in smaller pieces; the no-callback CPUs patch is one of those
pieces.
This patch set is in its second iteration; comments this time around have
been scarce. Barring surprises, it would not be surprising to see this
feature pushed into the 3.8 kernel. Most users will not care, but, for
those who obsess about latency and jitter, it should be a welcome addition.
Comments (none posted)
By Jonathan Corbet
October 29, 2012
In just a few days, a linux-kernel mailing list report of ext4 filesystem
corruption turned into a widely-distributed news story; the quality of ext4
and its maintenance, it seemed, was in doubt. Once the dust settled, the
situation turned out to be rather less grave than some had thought; the bug
in question only threatened a very small group of ext4 users using
non-default mount options. As this is being written, a fix is in testing
and should be making its way toward the mainline and stable kernels
shortly. The bug was
obscure, but there is value in looking at how it came about and the ripples
it caused.
The timeline
On October 23, user "Nix" was trying to help track down an NFS lock
manager crash when he ran into a little problem: the crash kept corrupting his filesystem, making the
debugging task rather more difficult than it would otherwise have been. He
reported the problem to the linux-kernel mailing list; he
also posted a warning for other LWN
readers. The ext4 developers moved quickly to find the problem, coming up
with a hypothesis within a few hours of the
initial report. Unfortunately, the hypothesis turned out to be wrong.
Before that became clear, though, a number of news outlets had posted
articles on the problem. LWN was not the first to do so ("first" is not at
the top of our list of priorities), but, late on the 24th, we, too, posted
an item about the issue. It quickly became
clear, though, that the original hypothesis did not hold water, and that
further investigation was in order. That investigation, as it turns out,
took a few days to play out.
Eric Sandeen eventually tracked the problem down to this
commit which found its way into the mainline during the 3.4 merge
window. That change was meant to be a cleanup, gathering the inode
allocation logic into a single function and removing some duplicated code.
The unintended result was to cause the inode bitmap to be modified outside of a
transaction, introducing unchecksummed data into the
journal. If the system crashed during that time, the next mount would
encounter checksum errors and refuse to play back the journal; the
filesystem was then seen as being corrupt.
The interesting thing is that, on most systems, this problem will never
come about because, on those systems, the journal checksums do not actually
exist. Journal checksumming is an optional feature, not enabled by
default, and, evidently, not widely used. Nix had turned on the feature
somewhat inadvertently; most other users do not turn it on at all, even if
they are aware it exists. Anybody who has journal checksums
turned off will not be affected by this bug, so very few ext4 users needed
to be concerned about potential data corruption.
As an interesting aside, checksums on the journal are a somewhat
problematic feature; as seen in this discussion
from 2008, it is not at all clear what the best response should be when
journal checksums fail to match. The journal checksum may not be
information that the system can reasonably act upon; indeed, as in this
case, it may create problems of its own.
Eric's patch appears to fix the problem;
corrupted journals that were easily observed before its application do not
happen afterward. There will naturally be a period of review and testing
before this change is merged into the mainline — nobody wants to create a
new problem through undue haste — but kernel
releases with a version of the fix (it has already been revised once) should be available to users in short
order. But most
users will not really care, since they were not affected by the problem in
the first place. They may care more about the plans to improve the
filesystem test suites so that regressions of this nature can be more
easily caught in the future.
Analysis
In retrospect, the media coverage of this bug was clearly out of proportion
to that bug's impact. One might attribute that to a desire for sensational
stories to drive traffic, and that may well be part of what was going on.
But there are a couple of other factors that are worth keeping in mind
before jumping to that judgment:
- Many media outlets employ editors and writers who, almost beyond
belief, are not trained in kernel programming. That makes it very
hard for them to understand what is really going on behind a
linux-kernel discussion even if they read that discussion rather than
basing a story on a single message received in a tip. They will see a
subject like "Apparent serious progressive ext4 data corruption,"
along with messages from prominent developers seemingly confirming the
problem, and that is what they have to go with. It is hard to blame
them for seeing a major story in this thread.
- Even those who understand linux-kernel discussions (LWN, in its arrogance,
places itself in this category) can be faced with an urgent choice. If
there were a data corruption bug in recent kernels, then we would
be beyond remiss to fail to warn our readers, many of whom run the
kernels in question. There comes a point where, in the absence of
better information, there is no alternative to putting something out
there.
The ext4 developers certainly cannot be faulted for the way this story
went. They did what conscientious developers do: they dropped everything
to focus on what appeared to be a serious regression affecting their
users. They might have avoided some of the splash by taking the discussion
private and not saying anything until they were certain of having found the
real problem, but that is not the way our community works. It is hard to
imagine that pushing development discussions out of the public view is
going to make things better in the long run.
Thus, one might conclude that we are simply going to see an occasional
episode like this, where a bug report takes on a life of its own and is
widely distributed before its impact is truly understood. Early reports of
software problems, arguably, should be treated like early software:
potentially interesting, but likely to be in need of serious review and
debugging. That's simply the world we live in.
A more serious concern may apply to the addition of features to the ext4
filesystem. Ext4 is viewed as the stable, production filesystem in the
Linux kernel, the one we're supposed to use while waiting for Btrfs to
mature. One might well question the addition of new features to this
filesystem, especially features that prove to be rarely used or that don't
necessarily play well with existing features. And, sure enough, Linux
filesystem developers have raised just this
kind of worry in the past. In the end, though, the evolution of ext4
is subject
to the same forces as the rest of the kernel; it will go in the directions
that its developers drive it. There is interest in enhancing ext4, so
new features will find their way in.
Before getting too worried about this prospect, though, it is worth
thinking about the history of ext4. This filesystem is heavily used with
all kinds of workloads; any problems lurking within will certainly emerge
to bite somebody. But problems that have affected real users have been
exceedingly rare and, even in this case, the number of affected users
appears to be countable without running out of fingers. Ext4, in other
words, has a long and impressive record of stability, and its developers
are determined to keep it that way; this bug can be viewed as the sort of
exception that proves the rule. One should never underestimate the value
of good backups, but, with ext4, the chances of having to actually use
those backups remain quite small.
Comments (81 posted)
Patches and updates
Kernel trees
- Thomas Gleixner: 3.6.3-rt9 .
(October 29, 2012)
- Thomas Gleixner: 3.6.3-rt7 .
(October 27, 2012)
- Thomas Gleixner: 3.6.3-rt8 .
(October 27, 2012)
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>