Brief items
The current development kernel is 3.7-rc5,
released on November 11. "
This is
quite a small -rc, I'm happy to say. -rc4 was already fairly calm, and -rc5
has fewer commits still. And more importantly, apart from one revert, and a
pinctl driver update, it's not just a fairly small number of commits, they
really are mostly one-liners."
Stable updates: no stable updates have been released in the last
week. 3.2.34 is in the review process as
of this writing; it can be expected on or after November 16.
Comments (none posted)
I for one do mourn POSIX, and standardization in general. I think
it's very sad that a lot of stuff these days is moving forward
without going through a rigorous standardization. We had this
little period known affectionately as the "Unix Wars" in the
1980s/90s and we're well on our way to a messy repeat in the Linux
space.
—
Jon
Masters
You're missing something; that is one of the greatest powers of
open source. The many eyes (and minds) effect. Someone out there
probably has a solution to whatever problem, the trick is to find
that person.
—
Russell King
Unfortunately there is no EKERNELSCREWEDUP, so we usually use EINVAL.
—
Andrew Morton
Comments (8 posted)
Back in early 2011, we
looked at
changes to the way Red Hat distributed its kernel changes. Instead of separate patches, it switched to distributing a tarball of the source tree—a move which was met with a fair amount of criticism. The Ksplice team at Oracle has just
announced the availability of a Git tree that breaks the changes up into individual patches again. "
The Ksplice team is happy to announce the public availability of one of our git repositories, RedPatch. RedPatch contains the source for all of the changes Red Hat makes to their kernel, one commit per fix and we've published it on oss.oracle.com/git. With RedPatch, you can access the broken-out patches using git, browse them online via gitweb, and freely redistribute the source under the terms of the GPL."
(Thanks to Dmitrijs Ledkovs.)
Comments (85 posted)
Jon Masters has put together
a
summary of how atomic operations work on the ARM architecture for those
who are not afraid of the grungy details. "
To provide for atomic
access to a given memory location, ARM processors implement a reservation
engine model. A given memory location is first loaded using a special 'load
exclusive' instruction that has the side-effect of setting up a reservation
against that given address in the CPU-local reservation engine. When the
modified value it is later written back into memory, using the
corresponding 'store exclusive' processor instruction, the reservation
engine verifies that it has an outstanding reservation against that given
address, and furthermore confirms that no external agents have interfered
with the memory commit. A register returns success or failure."
Comments (5 posted)
Linux-next maintainer Stephen Rothwell has announced that the
November 15 linux-next release will be the last until
November 26.
Full Story (comments: none)
Herton Ronaldo Krzesinski has announced that Canonical intends to maintain
the 3.5.x stable kernel, which is shipped in the Ubuntu 12.10 release.
This kernel will be supported as long as 12.10, currently planned to be
through the end of March, 2014.
Full Story (comments: 1)
Kernel development news
By Jonathan Corbet
November 14, 2012
The kernel's behavior on non-uniform memory access (NUMA) systems is, by
most accounts, suboptimal; processes tend to get separated from their
memory, leading to lots of cross-node traffic and poor performance. Until
now, the work to improve this situation has been a story of two competing
patch sets; it recently
appeared that one
of them may be set to be merged as the result of decisions made outside of the
community's view. But nothing in memory management is ever simple, so it
should be unsurprising that the NUMA scheduling discussion has become more
complicated.
On November 6, memory management hacker Mel Gorman, who had not contributed
code of his own toward NUMA scheduling so far, posted a new patch series
called "Foundation for automatic NUMA balancing," or
"balancenuma" for short. He pointed out that there were objections to both
of the existing approaches to NUMA scheduling and that it was proving hard
to merge the best from each. So his objective was to add enough
infrastructure to the memory management subsystem to make it easy to
experiment with different NUMA placement policies. He also implemented a
placeholder policy of his own:
The actual policy it implements is a very stupid greedy policy
called "Migrate On Reference Of pte_numa Node (MORON)". While
stupid, it can be faster than the vanilla kernel and the
expectation is that any clever policy should be able to beat MORON.
In short, the MORON policy works by instantly migrating pages whenever a
cross-node reference is detected using the NUMA hinting mechanism. Mel's
second version, posted one week later,
fixes a number of problems, adds the "home node" concept (that tries to
keep processes and their memory on a single "home" NUMA node), and adds
some statistics gathering to implement a "CPU follows memory" policy that
can move a process to a new home node if it appears that better memory locality
would result.
Andrea Arcangeli, author of the AutoNUMA approach, said that balancenuma "looks OK" and that
AutoNUMA could be built on top of it. Ingo Molnar, instead, was less
accepting, saying "I've picked up a
number of cleanups from your series and propagated them into tip:numa/core
tree." He later added a request
that Mel rebase his work on top of the numa/core tree. He clearly did not
see the patch set as a "foundation" on which to build. A new numa/core
patch set was posted on November 13.
Peter Zijlstra, meanwhile, has posted an "enhanced NUMA scheduling with adaptive
affinity" patch set. This one does away with the "home node" concept
altogether; instead, it looks at memory access patterns to determine where
a process's memory lives and who that memory might be shared with. Based
on that information, the CPU affinity mechanism is used to move processes
to the appropriate nodes. Peter says:
Note that this adaptive NUMA affinity mechanism integrated into the
scheduler is essentially free of heuristics - only the access
patterns determine which tasks are related and grouped. As a result
this adaptive affinity code is able to move both threads and
processes close(r) to each other if they are related - and let them
spread if they are not.
This patch set has not gotten a lot of review comments, and it does not
appear to have been folded into the numa/core series as of this writing.
What will happen in 3.8?
The numa/core approach remains in linux-next, which is intended
to be the final stage for code that is intended to be merged. And, indeed,
Ingo has reiterated that he plans to merge
this code for the 3.8 cycle, saying "numa/core sums up the consensus
so far." The use of that language might rightly raise some
eyebrows; when there are between two and four competing patch sets
(depending on how one counts) aimed at the same
problem, the term "consensus" does not usually come to mind. And, indeed,
it seems that this consensus does not yet exist.
Andrew Morton has been overtly grumpy; the existence of numa/core in
linux-next has made the management of his tree (which is based on
linux-next) difficult — his tree needs to be ready for the 3.8 merge window
where, he thinks, numa/core should not be
under consideration:
And yes, I'm assuming you're not targeting 3.8. Given the history
behind this and the number of people who are looking at it, that's
too hasty... And I must say that I deeply regret not digging my
heels in when this went into -next all those months ago. It has
caused a ton of trouble for me and for a lot of other people.
Hugh Dickins, a developer who is not normally associated with this sort of
discussion, chimed in as well:
People are still reviewing and comparing competing solutions.
Maybe this latest will prove to be closest to the right answer,
maybe it will not. It's, what, about two days old right now?
If we had wanted to push in a good solution a little prematurely,
we would surely have chosen Andrea's AutoNUMA months ago, despite
efforts to block it; and maybe we shall still want to go that way.
Please, forget about v3.8, cut this branch out of linux-next, and
seek consensus around getting it right for v3.9.
Rik van Riel agreed, saying "Having
unreviewed (some of it NAKed) code sitting in tip.git and you trying to
force it upstream is not the right way to go." He also suggested
that, if anything should be considered for merging in 3.8, it would be
Mel's foundation patches.
And that is where the discussion stands as of this writing. There is a lot
of uncertainty about what might happen with NUMA scheduling in 3.8, meaning
that, most likely, nothing will happen at all. It is highly unlikely that
Linus would merge the numa/core set in the face of the above complaints;
he would be far more likely to sit back and tell the developers involved to
work out something they can all agree with. So this is a discussion that
might go on for a while yet.
Making changes to the memory management subsystem is a famously hard thing
to do, especially when the changes are as large as those being considered
here. But there is another factor that is complicating this particular
situation. As the term "NUMA scheduling" suggests, this is not just a
memory management problem. The path to improved NUMA performance will
require coordinated changes to — and greater integration between — the
memory management subsystem and the CPU scheduler. It's telling that the
developers on one side of this divide are primarily associated with
scheduler development, while those on the other side are mostly memory
management folks. Each camp is, in a sense, invading the other's turf in
an attempt to create a comprehensive solution to the problem; it is not
surprising that some disagreements have emerged.
Also implicit in this situation is that Linus is unlikely to attempt to
resolve the disagreement by decree. There are too many developers and too
many interrelated core subsystems involved. So some sort of rough
consensus will have to be found. Your editor's explicitly unreliable
prediction is that little NUMA-related work will be merged in the 3.8
development cycle. Under pressure from several directions, the developers
involved will figure out how to resolve their biggest differences in the
next few months. The resulting code will likely be at least partially
merged for 3.9 — later than many would wish, but the end result is likely
to be better than would be seen with a patch set rushed into 3.8.
Comments (none posted)
By Jonathan Corbet
November 14, 2012
One of the nice features of virtual memory is that applications do not have to
concern themselves with how much memory is actually available in the
system. One need not try to get too much work done to realize that some
applications (or their developers) have taken that notion truly to heart.
But it has often been suggested that the system as a whole would work
better if interested applications could be informed when memory is tight.
Those applications could react to that news by reducing their memory
requirements, hopefully heading off thrashing or out-of-memory situations.
The latest proposal along those lines is a new system call named
vmpressure_fd(); it is unlikely to be merged in its current form,
but it still merits a look.
The idea behind Anton Vorontsov's vmpressure_fd()
patch set is to create a mechanism by which
the kernel can inform user space when the system is under memory pressure.
An application using this call would start by filling out a
vmpressure_config structure:
#include <linux/vmpressure.h>
struct vmpressure_config {
__u32 size;
__u32 threshold;
};
The size field should hold the size of the structure; it is there
as a sort of versioning mechanism should more fields be added to the
structure in the future. The threshold field indicates the
minimum level of
notification the application is interested in; the available levels are:
- VMPRESSURE_LOW
-
The system is out of free memory and is having to reclaim pages to
satisfy new allocations. There is no particular trouble in
performing that reclamation, though, so the memory pressure, while
non-zero, is low.
- VMPRESSURE_MEDIUM
-
A medium level of memory pressure is being experienced — enough,
perhaps, to cause some swapping to occur.
- VMPRESSURE_OOM
-
Memory pressure is at desperate levels, and the system may be about
to fall prey to the depredations of the out-of-memory killer.
An application might choose to do nothing at low levels of memory pressure,
clean up some low-value caches at the medium level, and clean up everything
possible at the highest level of pressure. In this case, it would probably
set threshold to VMPRESSURE_MEDIUM, since notifications
at the VMPRESSURE_LOW level are not actionable.
Signing up for notifications is a simple matter:
int vmpressure_fd(struct vmpressure_config *config);
The return value is a file descriptor that can be read to obtain pressure
events in this format:
struct vmpressure_event {
__u32 pressure;
};
The current interface only supports blocking reads, so a read() on
the returned file descriptor will not return until a pressure notification has
been generated. Applications can use poll() to determine whether
a notification is available; the current patch does not support
asynchronous notification via the SIGIO signal.
Internally, the virtual memory subsystem has no simple concept of memory
pressure, so the patch has to add one. To that end, the "reclaimer
inefficiency index" is calculated by looking at the number of pages
examined by the reclaimer and how many of those pages could not be
reclaimed. The need to look at large numbers of pages to find reclaim
candidates indicates that reclaimable pages are getting hard to find — that
the system is under memory pressure in other words. The index is simply
the ratio of reclamation failures to the number of pages examined,
expressed as a percentage.
This percentage is calculated over a "window" of pages examined; by
default, it is generated each time the reclaimer looks at 256 pages. This
value can be changed by tweaking the new vmevent_window sysctl
knob. There are also controls to set the levels at which the various
notifications occur: vmevent_level_medium (default 60) and
vmevent_level_oom (default 99); the "low" level is wired at zero,
so it will trigger anytime that the system is actively looking for pages to
reclaim.
An additional mechanism exists to detect the out-of-memory case, since it
can be hard to distinguish it using just the reclaimer inefficiency index.
The reclaim code includes the concept of a "priority" which controls how
aggressive it can be to reclaim pages; its value starts at 12 and falls
over time as attempts to locate enough pages fail. If the priority falls
to four (by default; it can be set with the
vmevent_level_oom_priority knob), the system is deemed to be
heading into an out-of-memory state and the notification is sent.
Some reviewers questioned the need for a new system call. We already have
a system call — eventfd() — that exists to create file descriptors
for notifications from the kernel. Actually using eventfd() tends
to involve an interesting dance where the application gets a file
descriptor from eventfd(), opens a sysfs file, and writes the file
descriptor number into
the sysfs file to connect it to a specific source of events. But it is
an established pattern that might be best maintained here. Another
reviewer suggested using the perf events
subsystem, but Anton believes, not without
reason, that perf brings a lot of complexity to something that should be
relatively simple.
The other complaint has to do with the integration (or lack thereof) with
the "memcg" control-group-based memory usage controller. Memcg already has
a notification
mechanism (described in Documentation/cgroups/memory.txt)
that can inform a process when a control group is running out of
memory; it might make sense to use the same mechanism for this purpose.
Anton responded that the memcg mechanism
does not provide the same information, it does not account for all memory
use, and that it requires the use of control groups — not always a popular
kernel feature. Still, even if vmpressure_fd() is merged as a
separate mechanism, it will at least have to be extended to work at the
control group level as well. The code shows that this integration has been
thought about, but it has not yet been implemented.
Given these concerns, it seems unlikely that the current patch set will
find its way into the mainline. But there is a clear desire for this kind
of functionality in all kinds of use cases from very large systems to very
small ones (Anton's patches were posted from a linaro.org address). So,
one way or another, a kernel in the near future will probably have the
ability to inform processes that it is experiencing some memory pressure.
The next challenge will then be getting applications to use those
notifications and reduce that pressure.
Comments (1 posted)
By Jonathan Corbet
November 13, 2012
As the standing-room-only crowd at Thomas Gleixner's LinuxCon Europe 2012
talk showed, there is still a lot of interest in the progress of the
realtime preemption patch set. Your editor attended with the main purpose
of heckling Thomas, thinking that our recent
Realtime Linux
Workshop coverage would include most of the information to be heard in
Barcelona. As it turns out, though, there were some new things to be
learned, along with some concerns about the possible return of an old
problem in a new form.
At the moment, realtime development is concentrated in three areas. The
first is the ongoing
work to mitigate problems with software interrupts; that has been covered
here before and has not changed much since then.
On the memory management front, the SLUB allocator is now the default for
realtime kernels. A few years ago, SLUB was considered hopeless for the
realtime
environment, but it has improved considerably since then. It now works
well when allocating objects from the caches. Anytime it becomes necessary
to drop into the page allocator, though, behavior will not be
deterministic; there is little to be done about that.
Finally, the latest realtime patches include a new option called
PREEMPT_LAZY. Thomas has been increasingly irritated by the throughput
penalty experienced by realtime users; PREEMPT_LAZY is an attempt to
improve that situation. It is an option that applies only to the scheduling of
SCHED_OTHER tasks (the non-realtime part of the workload); it works by
deferring context switches after a task is awakened, even if the
newly-awakened task has a higher priority. Doing so reduces determinism,
but SCHED_OTHER was never meant to
be deterministic in the first place. The benefit is a reduction in context
switches and a corresponding improvement in SCHED_OTHER throughput.
When SLUB and PREEMPT_LAZY are enabled, the realtime kernel shows a 60%
throughput increase with some workloads. Someday, Thomas said (not entirely
facetiously), realtime will
be faster than mainline, especially for workloads involving a lot of
networking. He is looking forward to the day when the realtime kernel
offers superior network performance; there should be some interesting
conversations with the networking developers when that happens.
The realtime kernel, he claimed in summary, is at production quality.
Looking at all the code that has been produced in the realtime effort,
Thomas concluded that, at this point, 95% of it has been upstreamed into
the mainline kernel. What is missing before the rest can go upstream is
"mainline sellable" solutions for memory management, high-resolution timers
(hrtimers), and software interrupts. The memory management work is the
most complex, and the patches are still "butt ugly." A significant amount
of cleanup work will be required before those patches can make it into the
mainline.
The hrtimer code, instead, just requires harassing the maintainer (a
certain Thomas Gleixner) to get it into shape; it's just a matter of
time. There needs to be a "less horrible" way to solve the software
interrupt problem; the search is on. The rest of the realtime tree,
including the infamous locking patches, is all nicely self-contained and
should not be a problem for upstreaming.
So what is coming in the future? The next big feature looks to be CPU
isolation. This is not strictly a realtime need, but it is useful for some
realtime users. CPU isolation gives one or more processors over to
user-space code, with no kernel overhead at all (as long as that code does
not make any system calls, naturally). It is useful for applications that
cannot wait even for a deterministic interrupt response; instead, they poll
a device so that they can respond even more quickly to events. There are
also high-performance users who pour vast amounts of money into expensive
hardware; these users are willing to expend great effort for a 1%
performance gain. For some kinds of workloads, the increased cache
locality offered by CPU isolation can give an improvement of 3-4%, so it is
unsurprising that these users are interested. A number of developers are
working on this problem; some sort of solution should be ready before too
long.
Also coming is the long-awaited deadline
scheduler. According to Thomas, this code
shows that, sometimes, it is possible to get useful work out of academic
institutions. The solution is close, he said, and could possibly even be
ready for the 3.8 merge window.
There is also interest in doing realtime work from a KVM guest
system. That will allow us to offload our realtime automation tasks into
the cloud. Thomas clearly thought that virtualized realtime was a bit of a
silly idea, but there is evidently a user community looking for this
capability.
Where are the contributors?
Given that things seem so close, Thomas asked himself why things were
taking so long; the realtime effort has been going for some eight years
now. The answer is that the problems are hard and that the
manpower to solve them has been lacking. Noting that few developers have
been contributing to the realtime tree, Thomas started to wonder if there
was a lack of interest in the concept overall. Perhaps all this work was
being done, but nobody was using it?
An opportunity to answer that question presented itself when kernel.org
went down for an extended period in 2011. It became necessary to provide
an alternative site for people wanting to grab the realtime patches; that,
in turn, was the perfect opportunity to obtain download statistics. It
turns out that most realtime patch set releases saw about 3,000 downloads
within the first three days. About 30% of those went to European
corporations, 25% to American corporations, 20% to Asian corporations, 5%
to academic institutions, and 20% "all over." 75% of the downloads, he
said, were done by users at identifiable corporations constituting a "who's
who" of the industry.
All told, there were 2,250 corporations that downloaded the realtime patch
set while this experiment was taking place. Of all those companies, less
than 5% reported back to the realtime developers in any way, be it a patch,
a bug report, or even an "it works, thanks" sort of message. A 5% response
rate may seem good; it should be enough to get the bugs fixed. But a
further look showed that 80% of the reports came from Red Hat and the Open
Source Automation Development Laboratory; add in Thomas's company
Linutronix, and the number goes up to 90%. "What," he asked the audience,
"are the rest of you doing?"
Thomas's conclusion is that something is wrong. Perhaps we are seeing a
return of the embedded nightmare in a new
form? As in those days,
he does see private reports from companies that are keeping all of their
work secret. Private reports are better than nothing, but he would really
like to see more participation in the community: more success reports, bug
reports, documentation contributions, and fixes. Even incorrect fixes are
a great thing; they give a lot of information about the problem and ease
the process of making a proper fix.
To conclude, Thomas noted that some people have complained that his
roadmap slides are insufficiently serious. In response, he said, he took a
few days off and took a marketing course; that has allowed him to produce a
more suitable roadmap that looks like this:
Perhaps the best conclusion to be drawn from this roadmap is that Thomas is
unlikely to switch from development to marketing anytime in the near
future. That is almost certainly good news for Linux users — and probably for
marketing people as well.
[Your editor would like to thank the Linux Foundation for funding his
travel to Barcelona.]
Comments (45 posted)
Patches and updates
Kernel trees
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
- Lucas De Marchi: kmod 11 .
(November 11, 2012)
Page editor: Jonathan Corbet
Next page: Distributions>>