Brief items
The current development kernel is 2.6.39-rc4,
released on April 18. According to
Linus:
So things have sadly not continued to calm down even further. We
had more commits in -rc4 than we had in -rc3, and I sincerely hope
that upward trend doesn't continue.
That said, so far the only thing that has really caused problems
this release cycle has been the block layer plugging changes, and
as of -rc4 the issues we had with MD should hopefully now be behind
us. So we're making progress on that front too.
The short-form changelog is in the announcement, or see the
full changelog for all the details.
Stable updates: 2.6.34.9 was
released on April 17, 2.6.32.37 and 2.6.33.10 were released on April 15 (and
quickly followed by 2.6.32.38 and 2.6.33.11 to fix a problem with the RDS
network protocol), and 2.6.38.3 was
released on April 14.
The 2.6.32.39, 2.6.33.12, and 2.6.38.4 updates are in the review process as
of this writing; they can be expected on or after April 21.
Comments (none posted)
There really are only two acceptable models of development: "think
and analyze" or "years and years of testing on thousands of
machines". Those two really do work.
--
Linus Torvalds
Comments (none posted)
Texas Instruments has
announced
the delivery of a mobile-grade, battery-optimized Wi-Fi solution to the
open source Linux community as part of the
OpenLink project. "
OpenLink wireless connectivity drivers attach to open source development platforms such as BeagleBoard, PandaBoard and other boards. Whether working with Android, MeeGo or other Linux-based distributions, developers can now access code natively as part of their kernel builds to introduce the latest low-power wireless connectivity solution into their products. Additionally, community support and resources are available 24/7 via the active OpenLink community."
Comments (6 posted)
MIPS Technologies has announced the launch of its new developer community
developer.mips.com. "
The
new site, which is live now, is specifically tailored to the needs of
software developers working with the Android(TM) platform, Linux operating
system and other applications for MIPS-Based(TM) hardware. All information
and resources on the site are openly accessible."
Full Story (comments: 17)
By Jonathan Corbet
April 20, 2011
The kernel has two different ways of dealing with systems where there are
large gaps in the physical memory address space - DISCONTIGMEM and
SPARSEMEM. Of those two, DISCONTIGMEM is the older; it has been
semi-deprecated for some time and appears to be on its (slow) way out. But
some architectures still use it. Recent changes (and the resulting
crashes) have shown that there are some interesting misunderstandings about
how DISCONTIGMEM is handled in the kernel.
The problem comes down to this: DISCONTIGMEM tracks separate ranges of
memory by putting each into its own virtual NUMA node. The result is that
a system running in this mode can appear to have multiple NUMA nodes, even
if NUMA support is not configured in. That apparently works well much of
the time, but it recently has been shown to cause crashes in the SLUB
allocator, which is not prepared for the appearance of multiple NUMA nodes
on a non-NUMA system.
There was a surprisingly acrimonious discussion on just whose fault this
misunderstanding is and how to fix it. Options including changing
DISCONTIGMEM to not "abuse" (in some peoples' view) the NUMA concept in
this way; that might be a long-term solution, but the bug exists now and,
as James Bottomley put it: "That has
to be fixed in -stable. I don't really think a DISCONTIGMEM re-engineering
effort would be the best thing for the -stable series." Another
option is to force NUMA support to be configured in when DISCONTIGMEM is
used; that could bloat the kernel on embedded systems and requires
acceptance of the strange concept that uniprocessor systems can be NUMA.
The kernel could be fixed to handle non-zero NUMA nodes at all times; that
could involve a significant code audit as the problems might not be limited
to the SLUB allocator. The SLUB allocator could be disallowed on non-NUMA
DISCONTIGMEM systems, but, once again, there may be issues elsewhere. Or
the process of escorting DISCONTIGMEM out of the kernel could be expedited
- though that would not be suitable for the stable series.
As of this writing the discussion continues; it's not clear what form the
real solution will take. The problem is subtle and there do not appear to
be any easy fixes at hand.
Comments (none posted)
Kernel development news
By Jonathan Corbet
April 19, 2011
The kernel's ARM architecture support is one of the fastest-moving parts of
a project which, as a whole, is anything but slow. Recent
concerns about the state of the code in the
ARM tree threaten to slow things down considerably, though, with some
developers now worrying in public that support for new platforms could be
delayed indefinitely. The situation is probably not that grim, but some
changes will certainly need to be made to get ARM development back on
track.
Top-level ARM maintainer Russell King recently looked at the ARM patches in linux-next and
was not pleased with what he saw. About 75% of all the
architecture-specific changes in linux-next were for the ARM architecture, and
those changes add some 6,000 lines of new code. Some of this work is
certainly justified by the fact that the appearance of new ARM-based
processors and boards is a nearly daily event, but it is still problematic
in an environment where there have been calls for the ARM code to shrink.
So, Russell suggested: "Please take a moment to consider how Linus
will react to this at the next merge window."
As it turns out, relatively little consideration was required; Linus showed
up and told the ARM developers what to
expect:
Hint for anybody on the arm list: look at the dirstat that rmk
posted, and if your "arch/arm/{mach,plat}-xyzzy" shows up a lot,
it's quite possible that I won't be pulling your tree unless the
reason it shows up a lot is because it has a lot of code removed.
People need to realize that the endless amounts of new pointless
platform code is a problem, and since my only recourse is to say
"if you don't seem to try to make an effort to fix it, I won't pull
from you", that is what I'll eventually be doing.
Exactly when I reach that point, I don't know.
A while back, most of the ARM subplatform maintainers started managing
their own trees and sending pull requests directly to Linus. It was a move
that made some sense; the size and diversity of the ARM tree makes it hard
for a single top-level maintainer to manage everything. But it has also
led to a situation where there seems to be little overall control, and that
leads to a lot of duplicated code. As Arnd Bergmann put it:
Right now, every subarchitecture in arm implements a number of
drivers (irq, clocksource, gpio, pci, iommu, cpufreq, ...). These
drivers are frequently copies of other existing ones with slight
modifications or (worse) actually are written independently for the
same IP blocks. In some cases, they are copies of drivers for stuff
that is present in other architectures.
The obvious solution to the problem is to pull more of the code out of the
subplatforms, find the commonalities, and eliminate the duplications. It
is widely understood that a determined effort along these lines could
reduce the amount of code in the ARM tree considerably while simultaneously
making it more generally useful and more maintainable. Some work along
these lines has already begun; some examples include Thomas Gleixner's work to consolidate
interrupt chip drivers, Rafael Wysocki and
Kevin Hilman's work to unify some of the runtime power management code,
and Sascha Hauer's "sanitizing crazy clock data
files" patch.
Some of the ongoing work could benefit architectures beyond ARM as well.
It has been observed, for example, that most GPIO drivers tend to look a
lot alike. There are, after all, only so many ways that even the most
imaginative hardware designers can come up with to control a wire with a
maximum of two or three states. The kernel has an unbelievable number of
GPIO drivers; if most of them could be reduced to declarations of which
memory-mapped I/O bits need to be twiddled to read or change the state of
the line, quite a bit of code could go away.
There is also talk of reorganizing the ARM tree so that most drivers no
longer live in subplatform-specific directories. Once all of the drivers
of a specific type can be found in the same place, it will be much easier
to find duplicates and abstract out common functionalities.
All of this work takes time, though, and the next merge window is due to
open in less than two months. Any work which is to be merged for 2.6.40
needs to be in a nearly-complete state by now; most of the work that
satisfies that criterion will be business as usual: adding new platforms,
boards, and drivers. Russell worries that
this work is now unmergeable:
Will we ever be able to put John's code in the kernel? Honestly, I
have no idea. What I do know is that unless we start doing
something to solve the problem we have today with the quantity of
code under arch/arm _and_ the constant churn of that code, we will
_never_ be able to add new platform support in any shape or form to
the kernel.
Russell has an occasional tendency toward drama that might cause readers to
discount the above, but he's not alone in these worries. Mark Brown is concerned that ARM development will come to a
halt for the next several months; he also has expressed doubts about the whole idea that the
ARM tree must shrink before it can be allowed to grow again:
What we're telling people to do is work on random improvements to
more or less tangentially related code. This doesn't seem
entirely reasonable and is going to be especially offputting for
new contributors (like the people trying to submit new platforms,
many of them will be new to mainline work) as it's a pretty big
jump to start working on less familiar code when you're still
trying to find your feet and worried about stepping on people's
toes or breaking things, not to mention justifying your time to
management.
If these fears hold true, we could be looking at a situation where the
kernel loses much of its momentum - both in support for new hardware and in
getting more contributions from vendors. The costs of such an outcome
could be quite high; it is not surprising that people are concerned.
In the real world, though, such an ugly course of events seems unlikely.
Nobody expects the ARM tree to be fixed by the 2.6.40 merge window; even
Linus, for all his strongly-expressed opinions, is not so unreasonable.
Indeed, he is currently working on a patch to
git to make ARM cleanup work not look so bad in the statistics.
What is needed in the near future is not a full solution; it's a clear
signal that the ARM development community is working toward that solution.
Some early cleanup work, some pushback against the worst offenses, and a
plan for following releases should be enough to defer the Wrath Of Linus
for another development cycle. As long as things continue to head in the
right direction thereafter, it should be possible to keep adding support
for new hardware.
Observers may be tempted to view this whole episode as a black mark for the
kernel development community. How can we run a professional development
project if this kind of uncertainty can be cast over an entire
architecture? What we are really seeing here, though, is an example of how
the community tries to think for the long term. Cramming more ARM code
into the kernel will make some current hardware work now, but, in the long
term, nobody will be happy if the kernel collapses under its own weight.
With luck, some pushback now will help to avoid much more significant
problems some years down the line. Those of us who plan to still be
working on (and using) Linux then will benefit from it.
Comments (5 posted)
By Jake Edge
April 20, 2011
There was a large Linaro presence at this year's Embedded
Linux Conference with speakers from the organization reporting on
its efforts to consolidate functionality from the various ARM
architecture trees. One of those talks was by Amit Kucheria,
technical lead for the power management working group (PMWG), who talked about
what the working group has been doing since it began.
That includes some work on tools like powertop, and the newly available
PowerDebug, as well as some consolidation within the kernel tree.
He also highlighted areas where Linaro plans to focus its efforts in the
future.
Kucheria started with a look at what Linaro is trying to accomplish, part
of which is to "take the good things in the BSP [board support
package] trees and get them upstream". In addition, consolidating
the kernel source, so that there is one kernel tree that can be used by all
of the Linaro partners, is high on the list. There is a fair amount of
architecture consolidation that is part of that, including things like
reducing the
"ten or twenty memcpy() functions" to one version
optimized for all of the ARM processors. All of that work should
result in patches that get sent upstream.
The PMWG has "existed for six to eight months now", Kucheria
said, and has been focused on consolidation and tools. There has been a
bit of kernel work, which includes ensuring that the clock tree is exported
in the right place in debugfs for five System-on-a-chips (SoCs) that Linaro and its
sponsors/partners have targeted (Freescale i.MX51, TI OMAP 3 and 4, Samsung
Orion, and ST-Ericsson UX8500). In addition, work was done on cpufreq,
cpuidle, and CPU hotplug for some of them. Some of
that work is still in progress, but most of it has gone
(or is working its way) upstream, he said.
Beyond kernel work, the group has been working on tools, starting with
getting powertop to work with ARM CPUs and pushing that work upstream.
A new tool, PowerDebug, has been created to help look at the clock tree to
see "what clocks are on, which are active, and at what
frequency", Kucheria said. It also shows power regulators that have
registered with the regulator framework by pulling information from
sysfs. It shows which regulators
are on and what voltages are currently being used. Other SoCs or
architectures can use PowerDebug simply by exporting their clock tree into
debugfs.
PMWG has also been experimenting with thermal management and hotplug. In
particular, it has been looking at what policies make sense when the CPU
temperature gets too high. One possibility would be to hot-unplug a core
to reduce the amount of heat generated. There is some inherent latency in
plugging or unplugging a core, he said, which can range from 40-50ms in a
simple case
to several seconds if there are a lot of threads running.
There is a notification chain that causes the latency, so it's possible
that could be reduced by various means.
Complexity in power management
With a slide showing the complexity of Linux power management (shown at
right) today, Kucheria launched into a description of some of the problems
that OEMs are faced with when trying to tune products for good battery
life. In that diagram, he noted there are "six or seven different
knobs that you can twiddle" to adjust power usage. Those OEMs
simply don't have the resources to deal with that complexity, some kind of
simplification is required. In addition, the complexity is growing with
more and more SoCs along with different power management schemes
in the hardware.
In the "good old days", of five or six years ago, the OMAP 1 just used the
Linux driver model suspend hooks to change the clock frequency. The clock
framework was standard back then, but now there are 30 or 40 different
clock frameworks in the ARM tree. CPU frequency scaling (cpufreq) was
added after that, but it doesn't take into account the bus or coprocessor
frequencies. Later on, several different frameworks were added including
the regulator framework, cpuidle to control idle states, and power
management quality of
service (pm_qos).
The quality of service controls are important for devices that need to
bound the latency for coming out of idle states, for example for network
drivers that cannot tolerate more than 300ms of latency. The cpuidle
framework introduced some problems, though, Kucheria said, because they
were created by Intel, who concentrated on its platforms. The C-states
(C0-C6) don't really exist for ARM processors and various vendors
interpreted them differently for particular SoCs. In addition, some have
added additional states (C7, C8)
Later still in the evolution of Linux power management, hotplug support was
added, which can reduce the power consumption by unplugging CPU cores.
There are a number of outstanding issues there, though, including latency
and policy. Vendors have various "patches floating around",
but there isn't a consistent approach. Coming up with policies, perhaps
embodied in a hotplug governor, is something that needs to be done.
Runtime power management was the next component added in. PMWG would like
to use it to
reduce the need for drivers to talk directly to the clocks
and instead they would talk in a more general way to the runtime power
management
framework.
Lots of code
that is scattered around in various drivers can be centralized in bus
drivers, which will make the device drivers much more portable because they
don't refer to specific clocks. Vendors
have started switching over to using the runtime power management
framework, but "it's a painful
process" to change all of the drivers, he said.
The latest piece of the power management puzzle is the addition of
Operating Performance Points (OPP) support, which was added in 2.6.38. OPP
is a way to describe frequency/voltage pairs that a particular SoC will
support for its various sub-modules. OPP is very CPU/SoC-specific, but can
also encapsulate the requirements for different buses and co-processors.
The cpufreq framework can make use of the information as it changes the
frequency characteristics of different parts of the hardware.
As more dual-core and quad-core packages are being used, heat can be a
problem. The existing thermal management framework is not being used by ARM
vendors yet and there are a number of issues to be resolved. Linaro wants
to "figure it out once and for all", and that is one its
focuses in the coming months. One of the questions is what should be done
when the system is overheating. Should it unplug one or more cores? Or
reduce the frequency of the CPU clock? One of the "crazy
things" PMWG has been thinking about is registering devices that can
reduce their frequency as "cooling devices" (since they will generate less
heat with a lower frequency).
PMWG's plans
The existing thermal management code works for desktop Linux, Ubuntu in
particular, and also for Android, but there is still some experimenting
that needs to be done to come up with an ARM-wide solution. Another area
that PMWG will work on is adding scheduling domains for ARM so that you can
"tweak your scheduler policy" regarding how processes and
threads get spread around on multiple cores. Scheduling domains and
sched_mc tunables could eliminate the need
for hotplug in some cases, he said.
Rationalizing the names and abilities of the processor C-states is also
something that PMWG will be working on. Kucheria said that PMWG wants to
"start a conversation" with the relevant vendors and
developers to make that happen. PowerDebug enhancements are also on the
radar: "If you need stuff [in PowerDebug], let us know".
There is lots of other consolidation work that could be done, but there are
only enough developers to address the parts he described, at least in the
near term.
At the end of the talk, Kucheria put the Linux power management diagram
slide back up, noting that the complexity was "great for job
security". There is clearly plenty of work to do in the ARM tree in
the months ahead. Kucheria's talk just covered the work going on in the
power management group, but there are four other groups within Linaro
(kernel, toolchain, graphics, and multimedia) that are doing similar jobs
inside and outside of the kernel. One gets the sense that the companies
who founded Linaro were getting as tired of the chaotic ARM world as the
kernel developers (e.g. Linus Torvalds) are. So far, the organization has
made some strides, but there is a long way to go.
Comments (none posted)
By Jonathan Corbet
April 19, 2011
Swapping, like page writeback, operates under some severe constraints. The
ability to write dirty pages to backing store is critical for memory
management; it is the only way those pages can be freed for other uses. So
swapping must work well in situations where the system has almost no memory
to spare. But writing pages to backing store can, itself, require memory.
This problem has been well solved (with mempools) for locally-attached
devices, but network-attached devices add some extra challenges which have
never been addressed in an entirely satisfactory way.
This is not a new problem, of course; LWN ran
an article about swapping over network block devices (NBD) almost
exactly six years ago. Various approaches were suggested then, but none
were merged; it remains to be seen whether the
latest attempt (posted by Mel Gorman based on a lot of work by Peter
Zijlstra) will be more successful.
The kernel's page allocator makes a point of only giving out its last pages
to processes which are thought to be working to make more memory free. In
particular, a process must have either the PF_MEMALLOC or
TIF_MEMDIE flag set; PF_MEMALLOC indicates that the
process is currently performing memory
compaction or direct reclaim, while TIF_MEMDIE means the
process has run afoul of the out-of-memory killer and is trying to exit.
This rule should serve to keep some memory around for times when it is
needed to make more memory free, but one aspect of this mechanism does not
work entirely well: its interaction with slab allocators.
The slab allocators grab whole pages and hand them out in smaller chunks.
If a process marked with PF_MEMALLOC or TIF_MEMDIE
requests an object from the slab allocator, that allocator can use a
reserved page to satisfy the request. The problem is that the remainder of
the page is then made available to any other process which may make a
request; it could, thus, be depleted by processes which are making the
memory situation worse, not better.
So one of the first things Mel's patch series does is to adapt a patch by
Peter that adds more awareness to the slab allocators. A new
boolean value (pfmemalloc) is added to struct page to
indicate that the corresponding page was allocated from the reserves; the
recipient of the page is then expected to treat it with due care. Both
slab and SLUB have been modified to recognize this flag and reserve the
rest of the page for suitably-marked processes. That change should help to
ensure that memory is available where it's needed, but at the cost of
possibly failing other memory allocations even though there are objects
available.
The next step is to add a __GFP_MEMALLOC GFP flag to mark
allocation requests which can dip into the reserves. This flag separates
the marking of urgent allocation requests from the process state - a change
will be useful later in the series, where there may be no convenient
process state available. It will be interesting to see how long it takes
for some developer to attempt to abuse this flag elsewhere in the kernel.
The big problem with network-based swap is that extra memory is required
for the network protocol processing. So, if network-based swap is to work
reliably,
the networking layer must be able to access the memory reserves. Quite a
bit of network processing is done in software interrupt handlers which run
independently of any given process. The __GFP_MEMALLOC flag
allows those handlers to access reserved memory, once a few other tweaks
have been added as well.
It is not desirable to allow any network operation to access the
reserves, though; bittorrent and web browsers should not be allowed to
consume that memory when it is urgently needed elsewhere. A new function,
sk_set_memalloc(), is added to mark sockets which are involved
with memory reclaim. Allocations for those sockets will use the
__GFP_MEMALLOC flag, while all other sockets have to get by with
ordinary allocation priority. It is assumed that only sockets managed
within the kernel will be so marked; any socket which ends up in user space
should not be able to access the reserves. So swapping onto a FUSE
filesystem is still not something which can be expected to work.
There is one other problem, though: incoming packets do not have a special
"needed for memory reclaim" flag on them. So the networking layer must be
able to allocate memory to hold all incoming packets for at least as
long as it takes to identify the important ones. To that end, any network
allocation for incoming data is allowed to dip into the reserves if need
be. Once a packet has been identified and associated with a socket, that
socket's flags can be checked; if the packet was allocated from the
reserves and the destination socket is not marked as being used for
memory reclaim, the packet will be dropped immediately. That change should
allow important packets to get into the system without consuming too much
memory for unimportant traffic.
The result should be a system where it is safe to swap over a network block
device. At least, it should be safe if the low watermark - which controls
how much memory is reserved - is high enough. Systems which are swapping
over the net may be expected to make relatively heavy use of the reserves,
so administrators may want to raise the watermark (found in
/proc/sys/vm/min_free_kbytes) accordingly. The final patch in the
series keeps an eye on the reserves and start throttling processes
performing direct reclaim if they get too low; the idea here is to ensure
that enough memory remains for a smaller number of reclaimers to actually
get something done. Adjusting the size of the reserves dynamically might
be the better solution in the long run, but that feature has been omitted
for now in the interest of keeping the patch series from getting too large.
Comments (none posted)
Patches and updates
Kernel trees
Core kernel code
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>