Brief items
The current development kernel remains 2.6.39-rc7. Linus has stated
his intent to release the final 2.6.39 kernel on May 18, but that
release has not happened as of this writing. Presumably he is simply
waiting for the LWN Weekly Edition to be published; 2.6.39 will almost
certainly be out by the time you read this.
Stable updates: there have been no stable kernel updates in the last
week.
Comments (4 posted)
I like the %p thingy - it's neat and is an overall improvement. If
it dies I shall stick another pin in my Ingo doll.
--
Andrew Morton (who also provided
a
picture of said doll)
HAMMER2 implements a root directory which is ABOVE the nominal
mount point for the filesystem. That is, the nominal mount point
is typically a file inside this directory instead of the directory
itself.
This feature can be replicated for any subdirectory, where the
parent holds multiple snapshots of said directory. There is no
global snapshot table per-say.
This makes it possible to trivially construct and maintain multiple
mirroring domains within any subdirectory structure. For example
you can construct a HAMMER2 filesystem which holds multiple roots
and then mount the desired one based on a boot menu item, and you
can work within these roots as if they were the root of the whole
filesystem (even though they are not).
--
Matthew
Dillon launches another new filesystem
Comments (7 posted)
By Jonathan Corbet
May 17, 2011
There has been a determined effort over the last few kernel development
cycles to eliminate the leakage of kernel addresses into user space. A
determined attacker, it is thought, could use address information to figure
out where important data structures are in memory; that is an important
step toward corrupting those structures. So it arguably makes sense to
avoid exposing kernel addresses in
/proc files and other places
where the kernel provides information to user space.
Early in the 2.6.39 development cycle, a patch was applied to censor kernel
addresses appearing in /proc/kallsyms and /proc/modules.
On an affected system, /proc/kallsyms looks like this:
...
0000000000000000 V callchain_recursion
0000000000000000 V rotation_list
0000000000000000 V perf_cgroup_events
0000000000000000 V nr_bp_flexible
0000000000000000 V nr_task_bp_pinned
0000000000000000 V nr_cpu_bp_pinned
...
Needless to say, zeroing out the address information makes this file rather
less useful than it had been previously. What drew attention to this
change, though, was a report that perf
produces bogus information in this situation. It seems that perf was not
detecting the hiding of kernel addresses, so it happily went forward with
all those zero values.
That is obviously a bug in perf; it will be fixed shortly. But a number of
developers complained about the practice of hiding kernel addresses by
default. That behavior makes the system less useful than it was before,
and will certainly cause other surprises. People who want whatever extra
security is provided by this behavior should have to ask for it explicitly,
it was said; David Miller pointed out that
other security technologies - like SELinux - are not turned on by default.
That argument won the day, so the final 2.6.39 release will not hide kernel
pointers by default. Anybody wanting pointer hiding should turn it on by
setting the kernel.kptr_restrict knob to 1.
Comments (2 posted)
Kernel development news
By Jonathan Corbet
May 17, 2011
The control group mechanism allows an administrator to group processes
together and apply any of a number of resource usage policies to them. The
feature has existed for some time, but only recently have we seen
significant use of it. Control groups are now the basis for per-group CPU
scheduling (including the automatic per-session group scheduling that was
merged for 2.6.38), process management in systemd, and more. This feature
is clearly useful, but it also has a bad reputation among many kernel
developers who often are heard to mutter that they would like to yank
control groups out of the kernel altogether. In the real world, removing
control groups is an increasingly difficult thing to do, so it makes sense
to consider the alternative: fixing them.
One of the complaints about control groups is that they have been "bolted
on" to existing kernel mechanisms rather than properly integrated into
those mechanisms. Given the relatively late arrival of control groups,
that is, perhaps, not a surprising outcome. When attaching a significant
new feature to long-established core kernel code, it is natural to try to
keep to the side and minimize the intrusion on the existing code. But
bolting code onto the side is not always the way toward an optimal solution
which can be maintained over the long term. Some recent work with the memory
controller highlights this problem - and points toward an improvement
of the situation.
The system memory map consists of one struct page for each
physical page in the system; it can be thought of as an extensive array of
structures matching the array of pages:
The kernel maintains a global least-recently-used (LRU) list to track
active pages. Newly-activated pages are placed at the end of the list;
when it is time to reclaim pages, the pages at the head of the list will be
examined first. The structure looks something like this:
Much of the tricky code in the memory management subsystem has to do with
how pages are placed in - and moved within - this list.
Of course, the situation is a little more complicated than that. The
kernel actually maintains two LRU lists; the second one holds "inactive"
pages which have been unmapped, but which still exist in the system:
The kernel will move pages from the active to the inactive list if it
thinks they may not be needed in the near future.
Pages in the inactive LRU can be moved quickly back to the active list if
some process tries to access them. The inactive list can be thought of as
a sort of probationary area for pages that the system is considering
reclaiming soon.
Of course, the situation is still more complicated than that. Current kernels
actually maintain five LRU lists. There are separate active and
inactive lists for anonymous pages - reclaim policy for those pages is
different, and, if the system is running without swap, they may not be
reclaimable at all. There is also a list for pages which are known not to
be reclaimable - pages which have been locked into memory, for example.
Oh, and it's only fair to say that one set of those lists exists for each memory
zone. Despite the proliferation of lists, this set, as a whole, is called
the "global LRU."
Creating a diagram with all these lists would overtax your editor's rather
inadequate drawing skills, though, so envisioning that structure is left as
an exercise for the reader.
The memory controller adds another level of complexity as the result of
its need to be able to reclaim pages belonging to specific control groups.
The controller needs to track more information for each page, including a
simple pointer associating each page with the memory control group it is
charged to. Adding that information to struct page was not really
an option; that structure is already packed tightly and there is little
interest in making it larger. So the memory controller adds a new
page_cgroup structure for each page; it has, in essence, created a
new, shadow memory map:
When memory control groups are active, there is another complete set of LRU
lists maintained for each group. The list_head structures needed
to maintain these lists are kept in the page_cgroup structure.
What results is a messy structure along these lines:
(Once again, the situation is rather more complicated than has been shown
here; among other things, there is a series of intervening structures
between struct mem_cgroup and the LRU lists.)
There are a number of disadvantages to this sort of arrangement. Global
reclaim uses the global LRU as always, so it operates in complete ignorance
of control groups. It will reclaim pages regardless of whether those pages
belong to groups which are over their limits or not. Per-control-group
reclaim, instead, can only work with one group at a time; as a result, it
tends to hammer certain groups while leaving others untouched. The
multiple LRU lists are not just complex, they are also expensive. A
list_head structure is 16 bytes on a 64-bit system. If that
system has 4GB of memory, it has 1,000,000 pages, so 16 million bytes
are dedicated just to the infrastructure for the per-group LRU lists.
This is the kind of situation that kernel developers are referring to
when they say that control groups have been "bolted onto" the rest of the
kernel. This structure was an effective way to learn about the memory
controller problem space and demonstrate a solution, but there is clearly
room for improvement here.
The memcg naturalization patches from
Johannes Weiner represent an attempt to create that improvement by better
integrating the memory controller with the rest of the virtual memory
subsystem. At the core of this work is the elimination of the duplicated
LRU lists. In particular, with this patch set, the global LRU no longer
exists - all pages exist on exactly one per-group LRU list. Pages which
have not been charged to a specific control group go onto the LRU list for
the "root" group at the top of the hierarchy. In essence, per-group
reclaim takes over the older global reclaim code; even a system with
control groups disabled is treated like a system with exactly one control
group containing all running processes.
Algorithms for memory reclaim necessarily change in this environment. The
core algorithm now performs a depth-first traversal through the control
group hierarchy, trying to reclaim some pages from each. There is no
global aging of pages; each group has its oldest pages considered for
reclaim regardless of what's happening in the other groups. Each group's
hard and soft limits are considered, of course, when setting reclaim
targets. The end result is that global reclaim naturally spreads the pain
across all control groups, implementing each group's policy in the
process. The implementation of control group soft limits has been
integrated with this mechanism, so now soft limit enforcement is spread
more fairly across all control groups in the system.
Johannes's patch improves the situation while shrinking the code by over
400 lines; it also gets rid of the memory cost of the duplicated LRU lists.
On the down side, it makes some fundamental changes to the kernel's memory
reclaim algorithms and heuristics; such changes can cause surprising
regressions on specific workloads and, thus, tend to need a lot of scrutiny
and testing. Absent any such surprises, this early-stage patch set looks
like a promising step toward the goal of turning control groups into a
proper kernel feature.
Comments (none posted)
May 18, 2011
This article was contributed by Paul McKenney
Some of you might have heard about
some discomfort with the state of
the ARM architecture in the kernel
recently.
Given that ARM Linux consolidation was one of the issues that
Linaro
was specifically set up to address, it is only fair to ask
“What is Linaro doing about this?”
So it should not come as a surprise that this topic featured
prominently at the recent
Linaro Developers Summit
in Budapest, Hungary.
Duplicate code and out-of-tree patches
make Linux on ARM more difficult to use and develop for.
Therefore, Linaro is working to consolidate code and to push code
upstream.
This should make the upstream Linux kernel
more capable of handling ARM boards and system-on-chips (SoCs).
However, ARM Linux kernel consolidation
is an issue not just for Linaro, but rather
across the entire ARM Linux kernel community, as well as
the ARM SoC, board, and system vendors.
Therefore, although I expect that Linaro will play a key role,
the ultimate solution spans the entire ARM
community.
It is also important to note that this effort is a proposal
for an experiment rather than a set of hard-and-fast marching orders.
Code organization
If we are to make any progress at all, we must start
somewhere.
An excellent place to start is by organizing the ARM Linux kernel
code by function rather than by SoC/board implementation.
Grouping together code with similar purposes will make it easier
to notice common patterns and, indeed, common code.
For example, currently many ARM SoCs use similar “IP blocks”
(such as I2C controllers) but each SoC provides a completely different
I2C driver that lives in the corresponding arch/arm/mach-
directory.
We expect that drivers
for identical hardware “IP blocks”
across different ARM boards and SoCs
will be consolidated into a single driver that works with any system using
the corresponding IP block.
In some cases, differences in the way that a given IP block is connected
to the SoC or board in question may introduce complications, but such
complications can almost always be addressed.
This raises the question of where similar code should be moved to.
The short answer that was agreed to by all involved is
“Not in the arch/arm directory!”
Drivers should of course move to the appropriate
subdirectory of the top-level drivers tree.
That said, ARM SoCs have a wide variety of devices ranging from touchscreens
to GPS receivers to accelerometers,
and new types of devices can be expected to appear.
So in some cases it might be necessary not merely to move the driver to
a new place, but also to create a new place in the drivers
tree.
But what about non-driver code?
Where should it live?
It is helpful to look at several examples: (1) the struct clk
code that Jeremy Kerr, Russell King, Thomas Gleixner, and many others
have been working on,
(2) the device-tree code that Grant Likely has been leading up, and
(3) the generic interrupt chip implementation that Thomas Gleixner has
been working on.
The struct clk code is motivated by the fact that many
SoCs and boards have elaborate clock trees.
These trees are needed, among other things, to allow the tradeoff between
performance and energy efficiency to be set as needed for individual
devices on that SoC or board.
The struct clk code allows these trees to be represented
with a common format while providing plugins to accommodate behavior
specific to a given SoC or board.
The
generic interrupt chip
implementation
has a similar role, but with respect to
interrupt distribution rather than clock trees.
Device trees are intended
to allow the hardware configuration of a board to be represented
via data rather than code, which should ease the task of creating a
single Linux kernel binary that boots on a variety of ARM boards.
The device-tree infrastructure patches have recently been
accepted by Russell King, which
should initiate the transition of specific board code to device tree
descriptions.
The struct clk code is already used by
both the ARM and SH CPU architectures,
so it is not ARM-specific, but rather core Linux kernel code.
Similarly, the device-tree code is not ARM-specific; it
is also used by the PowerPC, Microblaze, and SPARC architectures, and even by
x86.
Device tree therefore is also Linux core kernel code.
The virtual-interrupt code goes even further, being common across all
CPU architectures.
The lesson here is that ARM kernel code consolidation need not necessarily
be limited to ARM.
In fact, the more architectures that a given piece of code supports,
the more developers can be expected to contribute both code and testing
to it, and the more robust and maintainable that code will be.
There will of course need to be at least some ARM-specific code,
but the end goal is for that code to be limited to ARM core architecture code
and ARM SoC core architecture code.
Furthermore, the ARM SoC core architecture code should consist primarily
of small plugins for core-Linux-kernel frameworks, which should in turn
greatly ease the development and maintenance of new ARM boards and SoCs.
It is all very easy to write about doing this, but quite another to
actually accomplish it.
After all, although there are a good number of extremely talented and
energetic ARM developers and maintainers, many of the newer ARM developers
are also new to the Linux kernel, and cannot
be expected to to know where new code should be placed.
Such people might be tempted to continue placing most of their code in
their SoC and board subdirectories, which would just perpetuate the
current ARM Linux kernel difficulties.
Part of the solution will be additional documentation, especially
on writing ARM drivers and board ports.
Deepak Saxena, the new Linaro Kernel Working Group lead, will be
making this happen.
Unfortunately, documentation is only useful to the extent that anyone
actually reads it.
Fortunately, just as every problem in computer science seems to be solvable
by adding an additional level of indirection, every maintainership
problem seems to be solvable by adding an additional git tree
and maintainers.
These maintainers would help generate common code and of course point
developers at documentation as it becomes available.
Git trees
One approach would be to use Nicolas Pitre's existing Linaro kernel
git tree.
However, Nicolas's existing git tree is an
integration tree
that allows people
to easily pull the latest and greatest ARM code against the most
recent mainline kernel version.
In contrast, a maintainership tree contains patches that are to be
upstreamed, normally based on a more-recent mainline release candidate.
If we tried to use a single git tree for both integration and for
maintainership, we would either unnecessarily expose ARM users to
unrelated core-kernel bugs, or we would fail to track mainline
closely enough for maintainership, which would force a full rebase
and testing cycle to happen in a very short time at the beginning of
each merge window.
Of course, in theory we could have both maintainership and
integration branches within the same git tree, but separating these
two very different functions into separate git trees is most likely
to work well, especially in the beginning.
This new git tree (which was announced
on May 18) will have at least one branch per participating ARM
subarchitecture, and these branches will not be normally subject to
rebasing, thus making it easy to develop against this new tree.
Following the usual practice, maintainers of participating ARM
sub-architectures will send pull requests to a group of maintainers
for this new tree. Also following the usual practice, a merge of all
the branches will be sent to Stephen Rothwell's -next tree, but the
branches will be individually pushed to Linus Torvalds, perhaps via
Russell King's existing ARM tree.
The pushing of individual branch to Linus might seem surprising,
but Linus really does want to see the conflicts that arise.
Such conflicts presumably help Linus identify areas in need of his
attention.
Of course, this new git tree will not be limited to Linaro, but neither
is it mandatory outside of Linaro.
That said, I am very happy to say that some maintainers outside of Linaro
have expressed interest in participating in this effort.
The Budapest meeting put forward a list of members of the
maintainership group for this new git
tree, namely Arnd Bergmann, Nicolas Pitre, and Marc Zyngier, with help
from Thomas Gleixner.
Russell King will of course also have write access to this tree.
The tree will be set up in time to handle the 2.6.41 merge window.
The plan is to start small and grow by evolution rather than by
any attempts at intelligent design.
As noted at the beginning of this article, this effort is an
experiment rather than a set of hard-and-fast marching orders.
Although this proposed experiment cannot be expected to solve each
and every ARM Linux problem, they will hopefully provide a good start.
Every little bit helps, and every cleanup frees a little time to
start on the next cleanup.
There is reason to hope that this effort will help to reduce the
“endless amounts of new pointless platform code”
that irritated Linus Torvalds last month.
I owe thanks to the many people who helped take notes at the recent
Linaro Developers Summit in Budapest, and to all the people involved
in the discussions, both in the room and via IRC.
Special thanks go to Jake Edge, David Rusling, Nicolas Pitre,
Deepak Saxena, and Grant Likely
for their review of an early draft of this article.
However, all remaining errors and omissions are the sole property
of the author.
Comments (none posted)
By Jonathan Corbet
May 18, 2011
Your editor first heard the "platform problem" described by Thomas
Gleixner. In short, the platform problem comes about when developers view
the platform they are developing for as fixed and immutable. These developers
feel that the component they are working on specifically (a device driver,
say) is the only part that they have any control over. If the kernel
somehow makes their job harder, the only alternatives are to avoid or work
around it. It is easy to see how such an attitude may come about, but the
costs can be high.
Here is a close-to-home example. Your editor has recently had cause to
tear into the cafe_ccic Video4Linux2 driver in order to make it work in
settings beyond its original target (which was the OLPC XO 1 laptop).
This driver has a fair amount of code for the management of buffers
containing image frames: queuing them for data, delivering them to the
user, implementing
mmap(), implementing the various buffer-oriented V4L2 calls, etc.
Looking at this code, it is quite clear that it duplicates the
functionality provided by the videobuf
layer. It is hard to imagine what inspired the idiotic cafe_ccic developer
to reinvent that particular wheel.
Or, at least, it would be hard to imagine except for the inconvenient fact
that said idiotic developer is, yes, your editor. The reasoning at the
time was simple: videobuf assumed that the underlying device was able to
perform scatter/gather DMA operations; the Cafe device was nowhere near so
enlightened. The obvious right thing to do was to extend videobuf to
handle devices which were limited to contiguous DMA operations; this job
was eventually done by Magnus Damm a couple years later. But, for the
purposes of getting the cafe_ccic driver going, it simply seemed quicker
and easier to implement the needed functionality inside the driver itself.
That decision had a cost beyond the bloating of the driver and the kernel
as a whole. Who knows how many other drivers might have benefited from the
missing capability in the years before it was finally implemented? An
opportunity to better understand (and improve) an important support layer
was passed up. As videobuf has improved over the years, the cafe_ccic
driver has been stuck with its own, internal implementation which has seen no
improvements at all. We ended up with a dead-end, one-off solution instead
of a feature that would have been more widely useful.
Clearly, with hindsight, the decision not to improve videobuf was a
mistake. In truth, it wasn't even a proper decision; that option was never
really considered as a way to solve the problem. Videobuf could not solve
the problem at hand, so it was simply eliminated from consideration.
The sad fact is that this
kind of thinking is rampant in the kernel community - and well beyond. The
platform for which a piece of code is being written appears fixed and not
amenable to change.
It is not all that hard to see how this kind of mindset can come about.
When one develops for a proprietary operating system, the platform is
indeed fixed. Many developers have gone through periods of their career
where the only alternative was to work around whatever obnoxiousness the
target platform might present. It doesn't help that certain layers of the
free software stack also seem frustratingly
unfixable to those who have to deal with them. Much of the time, there
appears to be no alternative to coping with whatever has been provided.
But the truth of the matter is that we have, over the course of many
years, managed to create a free operating system for ourselves. That
freedom brings many advantages, including the ability to reach across
arbitrary module boundaries and fix problems encountered in other parts of
the system. We don't have to put up with bugs or inadequate features in
the code we use; we can make it work properly instead. That is a valuable
freedom that we do not exploit to its fullest.
This is a hard lesson to teach to developers, though. A driver developer
with limited time does not want to be told that a bunch of duplicated or
workaround code should be deleted and common code improved instead. Indeed,
at a kernel summit a few years ago, it was generally agreed that, while
such fixes could be requested of developers, to require them as a condition
for the merging of a patch was not reasonable. While we can encourage
developers to think outside of their specific project, we cannot normally
require them to do so.
Beyond that, working on common code can be challenging and intimidating.
It may force a developer to move out of his or her comfort zone. Changes
to common code tend to attract more attention and are often held to higher
standards. There is always the potential of breaking other users of that
code. There may simply be the lack of time for - or interest in -
developing the wider view of the system which is needed for successful
development of common code.
There are no simple solutions to the platform problem. A lot of it comes
down to oversight and mentoring; see, for example, the ongoing effort to
improve the ARM tree, which has a severe case of this problem. Developers
who have supported the idea of bringing more projects together in the same
repository also have the platform problem in mind; their goal is to make
the lines between projects softer and easier to cross. But, given how
often this problem shows up just within the kernel, it's clear that
separate repositories are not really the problem. What's really needed is
for developers to understand at a deep level that platforms are amenable to
change and that one does not have to live with second-rate support.
Comments (17 posted)
Patches and updates
Core kernel code
Device drivers
Documentation
Filesystems and block I/O
- Sage Weil: d_prune .
(May 12, 2011)
Memory management
Networking
Architecture-specific
Security-related
- Mimi Zohar: EVM .
(May 16, 2011)
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>