Brief items
The current development kernel is 2.6.34-rc4,
released on April 12, about
a week later than would have been expected. The delay was the result of a
nasty VM regression (see below). All told, some 500 fixes have been merged
since -rc3; see the announcement for the short-form changelog, or see
the
full changelog for all the details.
There have been no stable updates released in the last week.
Comments (none posted)
Hey, all my other theories made sense too.. They just didn't work.
But as Edison said: I didn't fail, I just found three other ways to
not fix your bug.
--
Linus Torvalds
To be honest I think 4K stack simply has to go. I tend to call
it "russian roulette" mode.
It was just a old workaround for a very old buggy VM that couldn't
free 8K pages and the VM is a lot better at that now. And the
general trend is to more complex code everywhere, so 4K stacks
become more and more hazardous.
It was a bad idea back then and is still a bad idea, getting
worse and worse with each MLOC being added to the kernel each year.
--
Andi Kleen
Comments (7 posted)
By Jonathan Corbet
April 14, 2010
When Google's Mike Waychison
addressed the 2009 Kernel
Summit, one of the goals he laid out was the merging of Google's idle
cycle injection code into the mainline. Idle cycle injection is the forced
idling of the CPU to avoid overheating; essentially, it is Google's way of
running processors to the very edge of their capability without going
past that edge and allowing the smoke to escape. This sort of power
management is
certainly not a Google-specific problem, so it makes sense to get the code
upstream. Salman Qazi's recently posted
kidled patch series shows the
current form of this work.
The core idea is simple: through some new control files under
/proc/sys/kernel/kidled, the system administrator can set, on a
per-CPU basis, the percentage of time that the CPU should be idle and an
interval over which that percentage is calculated. If the end of an
interval draws near and the CPU has not been naturally idle for the
requisite time, kidled will force the processor to go idle for a while.
Naturally enough, there are some complications. The first is that it would
be nice to avoid forcing idle cycles when important processes are running.
So kidled includes the notion of "eager cycle injection." By way of the
control group mechanism, processes can be marked as being "interactive."
When so-marked processes are not running, kidled will try to get its
forced idle cycles in early. When interactive processes are
running, instead, idle cycles will be forced only when strictly necessary.
In this way, "interactive" processes will not be impeded by idle cycle
injection except when there is no alternative.
The other twist has to do with the accounting of idle CPU time. The
injection of idle cycles takes CPU time away from somebody; the kidled code
allows the administrator to say who the victims should be. There is
another control group parameter which controls the "power capping
priority" of each process. When idle cycles are injected, kidled will mess
around in the scheduler's data structures, causing processes with lower
priorities to be charged for the idle time. That means that, when CPU
usage must be throttled, specific processes can be made to suffer more than
others.
As of this writing, there has been little public discussion of the
patches. The core concept is not controversial, but it will be interesting to
see how the scheduler-related parts of the series are received.
Comments (10 posted)
Kernel development news
By Jake Edge
April 14, 2010
Embedded Linux
Conference (ELC) organizer Tim Bird surveyed the embedded Linux landscape in
a talk he gave at the conference. He looked at new and proposed kernel
features that embedded developers might be interested in as well as issuing
a "call to arms" to those developers to get more involved with the
rest of the community. This talk is a regular feature at ELC to help the
embedded community stay on top of the "ripping" speed of
kernel development.
There have been four new kernels since last year's conference, and Bird
listed the interesting features for the embedded space in each of
2.6.30-33, as well as noting that LogFS had finally made it into the kernel
in 2.6.34, something that he was concerned might not ever happen. The
speed of kernel development is amazing, he said, and the great thing about
it is that "even while I am sleeping in my bed, people are pounding
away on it".
He pointed to a few "patches to watch" that may be coming in
new kernels, specifically the kbuild CROSS_COMPILE option, which
will make it easier to build for multiple architectures. He also noted
Arnd Bergmann's asm-generic
patches that are geared towards making it easier to add new architectures
to the kernel—without propagating the bugs and quirks from existing ones.
Boot speed
Bird then looked at different "technology areas" to point out interesting
features or work going on in those areas. Boot time is a "hot
topic" right now; it would have been in the past if the embedded
community was more involved in mainline kernel development. The Moblin five second boot effort really
kickstarted that work. He noted that he has a Sony (his employer) video
camera that boots Linux in 1.5 seconds; "I'm very proud of
that", he said.
Several new kernel features are available to help reduce boot time,
including asynchronous function
calls, which allow some parts of device initialization to run in
parallel. There is also scripts/bootgraph.pl to help visualize
where boot time is being spent.
Devtmpfs was also noted as a
way to decrease boot times, with some seeing a 0.6 second reduction on desktops. Bird
said that there needs to be some testing done on the embedded side to see
how much it can help there. He also listed two patches that speed up
symbol resolution for module loading by getting rid of the current linear
search. One switches to a binary search and the other uses a hash table.
For Bird's use cases, he always statically links in drivers, but has heard
that more embedded developers are going the loadable module route.
Greg Kroah-Hartman piped up that he needed one of those two patches for MeeGo, but that
the submitters had disappeared. There was general agreement that
contacting them and getting something upstream would be good.
Filesystems
Several different filesystems for embedded use cases were listed by Bird.
Squashfs has been out of the mainline for years, but was merged in 2.6.29,
and has since been improved by others in a "classic case"
demonstrating the advantages of mainline code. Ubifs is also in the
mainline and folks at Toshiba have been characterizing its performance,
which they reported on at the CE Linux Forum (CELF) Japan Jamboree. It has
"really slow mount times" in some cases, which CELF would like
to fund someone to fix.
LogFS is "way better optimized" for certain flash devices and
has fast mount times, he said. He noted that AXFS, the advanced
execute-in-place (XIP) filesystem, had kind of disappeared, so it didn't
appear to be on track for mainlining. He has been
playing with AXFS at Sony to try to further decrease boot time.
Bird also noted that the VFAT
patent avoidance patches had not made it into the mainline. It would
be useful for some embedded devices, he said. Most embedded developers
work around the patent by disabling VFAT and using 8.3 filenames, which is
somewhat unfortunate. Another thing he is keeping an eye on is VFS-based
union mounts, which would allow embedded developers to stop creating
"filesystems with weird links" between them as is currently common.
Power management and realtime
The runtime power management code has been merged, which will allow suspending
and resuming individual system components to reduce power consumption.
There is ongoing work on asynchronous suspend/resume, which Bird said he
didn't know very much about, but it's "gotta be really cool".
An audience member helped out by saying that it is in some ways like the
asynchronous initialization code (for faster boot), but "in the other
direction".
The RT_PREEMPT patchset "continues its slow march into the
kernel", with threaded
interrupt handlers being merged in 2.6.30 and preparatory work for the future sleeping spinlocks merge that
went into 2.6.33. There
are still some big kernel lock issues (BKL) to be resolved and CELF may fund some
work in that area.
Kernel size
The slide for kernel size and memory use had a picture of a "hybrid
Winnebago", which is the image Bird has of the kernel today. It
just keeps growing in size. To help embedded developers make better use of
limited memory, there is the smem tool that was funded by
CELF. He has used it in a few projects this year and it "has been
very helpful".
Various compression methods have been added to compress the kernel image in
different ways. LZMA can be up to 30% better than gzip, and LZO is not as
good at compression, but is much faster. There are tradeoffs dependent on
processor speed and I/O bandwidth that make it more difficult to pick the
right compression method, as Dirk Hohndel pointed out.
The ramzswap device (also known as compcache) allows in-memory
compressed swap. It is "really cool" but the maintainer only
was able to benchmark on desktop systems. It would be good if someone
could do some benchmarking on embedded systems, Bird said.
Tracing and security
Ftrace now has support for dynamic probes that came in 2.6.33, and the perf
tool can place and use those probes as well. There is tracing of
kernel variable access and modification available now. The perf "diff" mode can show the
performance differences between two runs, and also came in 2.6.33.
The TOMOYO merge in 2.6.30 was "a big deal" because it finally was able
to get path-based security into the kernel. NTT Data is now adding TOMOYO
rules to Android. Bird is in favor of a diversity of choices for security
as it gives people a chance to demonstrate which is the best solution for
various use cases. As part of that, CELF funded a study [PDF] for applying the
Smack security module in a television use case and found that the overhead
was higher than they expected.
CELF contract work
CELF has funded various projects over the last year including smem,
out-of-memory notifications in cgroups, SquashFS, the Smack analysis,
device trees for ARM, and the -ffunction-sections work to put each
kernel function in its own section to assist with dead code removal. Going
forward, CELF has an open project
proposal plan that will start funding new projects in the next few
weeks. It is also sponsoring Matt Mackall to be one of the two Linux kernel
embedded maintainers (David Woodhouse is the other).
A call to arms
Bird ended his talk with a list of things that embedded developers can do
to work better with the community. At the top of that list was "work
at top of tree". He realized that when he gives these talks, he is
generally talking to people who aren't using the kernels he talks about
because embedded folks tend to pick a kernel and stick with it. "It's
difficult" to work with the most recent kernel, but it's worth it.
"Version gap is the single biggest problem" in embedded Linux.
He suggested that embedded developers beat up on their board vendor to get
board support packages using the latest kernels and to do their testing on
boards that are already supported in the mainline.
The other suggestion he had was not to "wait for others to test new
features", and instead to do the testing themselves. He listed a
number of things that need testing in the mainline: LogFS, Ubifs mount
times, ramzswap, runtime power management, and so on. "Post the
results to the elinux.org
wiki" or come to the next conference (October 27-28 in
Cambridge, UK) and tell him about it.
Comments (2 posted)
By Jonathan Corbet
April 13, 2010
During the stabilization phase of the kernel development cycle, the -rc
releases typically happen about once every week.
2.6.34-rc4 is a clear
exception to that rule, coming nearly two weeks after the
preceding -rc3 release. The holdup in this case was a nasty regression
which occupied a number of kernel developers nearly full time for days.
The hunt for this bug is a classic story of what can happen when the code
gets too complex.
Sending email to linux-kernel can be an intimidating prospect for a number
of reasons, one of which being that one never knows when a massive thread -
involving hundreds of messages copied back to the original sender - might
result. Borislav Petkov's 2.6.34-rc3 bug
report was one such posting. In this case, though, the ensuing thread
was in no way inflammatory; it represents, instead, some of the most
intensive head-scratching which has been seen on the list for a while.
The bug, as reported by Borislav, was a null pointer dereference which
would happen reasonably reliably after hibernating (and restarting) the
system. It was quickly recognized as being the same as another bug
report filed the same day by Steinar H. Gunderson, though this one did
not involve hibernation. The common thread was null pointer dereferences
provoked by memory pressure. The offending patch was identified by Linus almost immediately; it's
worth taking a look at what that patch did.
Way back in 2004, LWN covered the
addition of the anon_vma code; this patch was controversial at the time
because the upcoming 2.6.7 kernel was still expected to be an old-style
"stable, no new features" release. This patch, a 40-part series which
fundamentally reworked the virtual memory subsystem, was not seen as stable
material, despite Linus's attempt to characterize it as an
"implementation detail." Still, over time, this code has proved solid and
has not been changed significantly since - until now.
The problem solved by anon_vma was that of locating all
vm_area_struct (VMA) structures which reference a given anonymous
(heap or stack memory) page. Anonymous pages are not normally shared
between processes, but every call to fork() will cause
all such pages to be shared between the parent and the new child; that
sharing will only be broken when one of the processes writes to the page,
causing a copy-on-write (COW) operation to take place. Many pages are
never written, so the kernel must be able to locate multiple VMAs which
reference a given anonymous page. Otherwise, it would not be able to
unmap the page, meaning that
the page could not be swapped out.
The reverse mapping solution originally used in 2.6 proved to be far too
expensive, necessitating a rewrite. This rewrite introduced the
anon_vma structure, which heads up a linked list of all VMAs which
might reference a given page. So a fork() also causes every VMA in
the child process which contains anonymous pages to be added to a the list
maintained in the parent's anon_vma structure. The
mapping pointer in struct page points to the
anon_vma structure, allowing the kernel to traverse the list and
find all of the relevant VMA structures.
This diagram, from the 2004 article, shows how this data structure looks:
This solution scaled far better than its predecessor, but eventually the
world caught up. So Rik van Riel set out to make things faster, writing this
patch, which was merged for 2.6.34. Rik describes the problem this
way:
In a workload with 1000 child processes and a VMA with 1000
anonymous pages per process that get COWed, this leads to a system
with a million anonymous pages in the same anon_vma, each of which
is mapped in just one of the 1000 processes. However, the current
rmap code needs to walk them all, leading to O(N) scanning
complexity for each page.
Essentially, by organizing all anonymous pages which originated in the same
parent under the same anon_vma structure, the kernel created a
monster data structure which it had to traverse every time it needed to
reverse-map a page. That led to the kernel scanning large numbers of VMAs
which could not possibly reference the page, all while holding locks. The
result, says Rik, was "catastrophic failure" when running the AIM
benchmark.
Rik's solution was to create an anon_vma structure for each
process and to link those together instead of the VMA structures. This
linking is done with a new structure called anon_vma_chain:
struct anon_vma_chain {
struct vm_area_struct *vma;
struct anon_vma *anon_vma;
struct list_head same_vma;
struct list_head same_anon_vma;
};
Each anon_vma_chain entry (AVC) maintains two lists: all
anon_vma structures relevant to a given vma (same_vma),
and all VMAs which fall within the area covered by a given
anon_vma structure (same_anon_vma). It gets
complicated, so some diagrams might help. Initially, we have a single
process with one anonymous VMA:
Here, "AV" is the anon_vma structure, and "AVC" is the
anon_vma_chain structure seen above. The AVC links to both the
anon_vma and VMA structures through direct pointers. The (blue)
linked list pointer is the same_anon_vma list, while the (red)
pointer is the same_vma list. So far, so simple.
Imagine now that this process forks, causing the VMA to be copied in the
child; initially we have a lonely new VMA like this:
The kernel needs to link this VMA to the parent's anon_vma
structure; that requires the addition of a new anon_vma_chain:
Note that the new AVC has been added to the blue list of all VMAs
referencing a given anon_vma structure. The new VMA also needs
its own anon_vma, though:
Now there's yet another anon_vma_chain structure linking in the
new anon_vma. The new red list has been expanded to contain all
of the AVCs which reference relevant anon_vma structures. As your
editor said, it gets complicated; the diagram for the 1000-child scenario
which motivated this patch will be left as an exercise for the reader.
When the
fork() happens, all of the anonymous pages in the area point back to the
parent's anon_vma structure. Whenever the child writes to a page
and causes a copy-on-write,
though, the new page will map back to the child's anon_vma
structure instead. Now, reverse-mapping that page can be done immediately,
with no need to scan through any other processes in the hierarchy. That
makes the lock contention go away, making benchmarkers happy.
The only problem is that embarrassing oops issue. Linus, Rik, Borislav,
and others chased after it, trying no end of changes. For a while, it
seemed that a bug causing excessive reuse of anon_vma structures
when VMAs were merged could be the problem, but fixing the bug did not fix
this oops. Sometimes, changing VMA boundaries with mprotect()
could cause the wrong anon_vma to be used, but fixing that one
didn't help either. The reordering of chains when they were copied was
also noted as a problem...but it wasn't the problem.
Linus was clearly beginning to wonder when
it might all end: "Three independent bugs found and fixed, and still
no joy?" He repeatedly considered just reverting the change
outright, but he was reluctant to do so; the solution seemed so
tantalizingly close.
Eventually he developed another hypothesis which seemed
plausible. An anonymous page shared between parent and child would
initially point to the parent's anon_vma:
But, if both processes
were to unmap the page (as could happen during system hibernation, for
example), then the child referenced it first, it could end up pointing to
the child's anon_vma instead:
If the parent mapped the page
later, then the child unmapped it (by exiting, perhaps), the parent would
be left with an
anonymous page pointing to the child's anon_vma - which no longer
exists:
Needless to say, that is a situation which is unlikely to lead to anything
good in the near future.
The fix is straightforward; when linking an
existing page to an anon_vma structure, the kernel needs to pick
the one which is highest in the process hierarchy; that guarantees that the
anon_vma will not go away prematurely.
Early testing suggests that the problem has
indeed been fixed. In the process, three other problems have been fixed
and Linus has come to understand a tricky bit of code which, if he has his
way, will soon gain some improved documentation. In other words, it would
appear to be an outcome worth waiting for.
Comments (34 posted)
Patches and updates
Kernel trees
Build system
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>