The current development kernel is 2.6.30-rc8
on June 2. It is
probably the last prepatch before the final 2.6.30 release. "A lot
of small stuff, fixing a few regressions (and at least one bugzilla entry
going back to 2.6.24). The small stuff does matter. Please test.
Full details can be found in the
There have been no stable releases over the last week; the last
stable update was 18.104.22.168 on
Comments (none posted)
Kernel development news
fyi, the above discussion transitions akpm into the "confused" state.
I'll keep the patch on hold until akpm transitions back out of that
-- Andrew "akpm" Morton
Because, when you think about it, there's really no merit in
having consistently wrong code. A mix of right and wrong is
better than 100% wrong.
-- Andrew Morton
The Dom0 push of Xen just seems too much like Linux being Xen's sex
slave, when it should be the other way around.
-- Steven Rostedt
Comments (none posted)
Retrying core dump writes: Paul Smith posted a patch that would retry short or interrupted
writes while dumping core, thus preventing the creation of an incomplete
core dump when a signal arrives. Alan Cox NAK-ed the patch noting: "The existing behaviour is an absolute godsend when you've something like
a core dump stuck on an NFS mount or something trying to core dump to
very slow media." But the idea did lead to some interesting
discussion of which signals should cause a core dump to be
interrupted—thus leaving a short core file—and which should be
There is an inherent difference between some interactive program
that is dumping core which a user might wish to interrupt with
SIGINT versus a non-interactive process which the user or
developer might wish to finish its core dump.
Smith describes one scenario: "a worker process might appear unresponsive due to a core being dumped
and the parent would decide to shoot it with SIGINT based on various
timeouts etc." No decision was made, but Roland McGrath analyzed four signal categories and noted that
at least two of the categories needed to be addressed as they are
mishandled by the current code.
Device tree. The Open Firmware "device tree" is a description of a
system's hardware configuration in a standardized data structure. Some
platforms have used device trees to separate the description of the
hardware from the kernel running on that hardware; that, in turn, allows
one kernel to support a wider variety of systems. Janboe Ye recently proposed adding device tree support to the ARM
architecture, which arguably supports the widest variety of hardware of
all. That has, in turn, led to a long discussion of how much device tree
really helps, and how feasible it is to create a single kernel for all
systems of a given architecture.
Developers of architectures using device tree seem to be happy with the
results; see this
2008 OLS paper [PDF] for a description of how things went with the
PowerPC architecture. Maintainers of other architectures are less
convinced, though. ARM maintainer Russell King worries that device tree could turn out to be
an expensive dead end; he would like to see a subset of ARM architectures
converted first to find out whether it is likely to work well or not. An
incremental approach probably makes sense in general, so that's how things
are likely to go.
The "host protected area" is an IDE concept which allows a
controller to hide a portion of a drive from the operating
system's view. When HPA was introduced years ago, its primary use was to
make large drives (by the standards of the day) appear small so that
certain legacy operating systems would not be confused. Linux, naturally,
never had any such problem, so the Linux IDE layer would traditionally
disable the HPA during the probing process. That was the right thing to do
at the time; it allowed Linux systems to make use of the entire drive.
It has been a while since operating systems required protection from the
shock of seeing an overly-large drive. But the HPA remains for
different reasons. Vendors will use the HPA to stash RAID information, for
example. Windows systems often come with a full "reinstall this system
from the beginning" recovery image - apparently a useful feature on that
platform. Rootkits sometimes hide information there. And so on. In all
cases but the last, it is probably a mistake for the operating system to
overwrite the HPA on contemporary systems. So turning off HPA protection
by default is no longer the right thing to do.
The libata driver subsystem has observed the HPA since the beginning, but
the IDE code retains its old default. That could change, though, with a patch set posted by IDE
maintainer Bartlomiej Zolnierkiewicz. These patches will cause the IDE
layer to preserve the HPA by default - unless the drive has partitions
which cover the HPA already. That test should be enough to ensure that
older systems continue to function while avoiding trashing the HPA on newer
drives. For systems not properly covered by this change, the
nohpa module parameter can be used to control HPA behavior
reflink(). There's another reflink()
proposal out there. This one simplifies the preserve argument
slightly, replacing the set of flags with an all-or-none option for now.
So reflink() can be used in the full snapshot mode (with suitable
privilege) or in the reflink-as-copy mode, but with no options in between.
Control over process IDs. The proposed checkpoint/restart feature
has a number of challenges to overcome. One of those is that processes can
become very confused if their process ID changes suddenly. So restarting a
checkpointed process requires that the process's old ID be restored as
well. The use of PID namespaces can help to ensure that the requisite IDs
are available, but there's no way in Linux to request that a process be
started with a specific ID.
Sukadev Bhattiprolu has a
proposal for a new system call to address this problem:
clone_with_pids(). It would behave like ordinary
clone(), but it takes an additional argument being an array of
process IDs. The array contains one desired process ID for each namespace
in the current hierarchy, with the first being the global namespace.
Deeply-nested processes can, thus, be created with a specific ID in each
namespace where it will appear.
This patch has been "gently tested" and not posted outside of the
containers list, so it has seen relatively little review thus far. Expect
some changes if this code starts to get closer to the mainline.
Comments (5 posted)
The recently-discussed kernel
memory sanitization patch
was criticized on a number of points, one of
which was its use of a dedicated page flag. Andi Kleen's HWPOISON patch
upcoming Intel CPU features for dealing with memory errors) have run into
trouble on similar grounds. The desperate shortage of page flags has been
an article of faith among kernel developers for years. But, interestingly,
not everybody agrees that a problem exists, and almost nobody can answer
the simple question of how many flags are available in the first place. So
a look at the Linux page flags issue seems in order.
"Page flags" are simple bit flags describing the state of a page of
physical memory. They are defined in <linux/page-flags.h>.
Flags exist to mark "reserved" pages (kernel memory, I/O memory, or simply
nonexistent), locked pages, those under writeback I/O, those which are part
of a compound page, pages managed by the slab allocator, and more.
Depending on the target architecture and kernel configuration options
selected, there can be as many as 24 individual flags defined.
These flags live in the flags field of struct page. This
field is declared to be an unsigned long, so one might think
that figuring out how much space is left for new flags would be a
straightforward task. To a casual observer, it would look like, on a
32-bit system, 24 flags have been used, leaving eight available:
In other words, the situation is starting
to get tight, but it is not a crisis quite yet.
But little is straightforward when it comes to struct page.
One of these structures exists for every physical page in the system; on a
4GB system, there will be one million page structures. Given
byte added to struct page is amplified a million times,
it is not surprising that there is a strong motivation to avoid growing this
structure at any cost. So struct page contains no less than
three unions and is surrounded by complicated rules describing which fields
are valid at which times. Changes to how this structure is accessed must
be made with great care.
Unions are not the only technique used to shoehorn as much information as
possible into this small structure. Non-uniform memory access (NUMA)
systems need to track information on which node each page belongs to, and
which zone within the node as well. Rather than add fields to
struct page, the NUMA hackers grabbed the free bits at the
top of the flags field, yielding something like this:
So, on a 32-bit system with 24 page flags
defined (a pessimistic scenario), there are eight bits available for the
node and zone information, practically limiting 32-bit NUMA systems to
64 nodes, which is almost certainly adequate. But the addition of
more page flags would come at the cost of supporting fewer NUMA nodes, and
that would be unwelcome.
Things get worse on systems with complicated physical memory layouts. On
such systems, memory is not organized into a single, continuous range of
physical addresses; instead, it is spread out with holes in the middle.
Memory management on these "sparse memory" systems requires that each page
have a "section" number associated
with it. That section number is stored - you guessed it - in the spare
bits at the top of the flags field. If space gets too tight, the
kernel will move the node number into a separate array, slowing things down
in the process. Either way, it seems clear that there is not a whole lot
of spare room in the flags field on these systems.
So the real answer to "how many page flags are free?" is, for all practical
purposes, "zero," at least on 32-bit NUMA systems. Making room for more
would require expanding struct page, which is a heavy cost to
pay. Developers should, thus, not be surprised when proposals to use new
page flags run into stiff opposition. It's only one bit, but that bit is
in the middle of some of the most sought-after real estate in the entire
In the case of Andi's HWPOISON patch, this opposition has come in the form
of a number of alternative suggestions. One was to simply use the "reserved" bit, but
that could lead to difficulties in parts of the code where that usage is
not expected. Then it was suggested that
the combination of the "reserved" and "writeback" flags could indicate a
poisoned page, but Andi claims that this
approach cannot work. Andrew Morton has suggested that HWPOISON could be made into a
64-bit-only feature; Andi allows as to how that might be possible, but he
clearly doesn't like the idea.
Instead, Andi takes the position that the page flag
shortage does not really exist. It's not a problem at all on 64-bit
systems, where unsigned long is twice as wide. The number of
32-bit systems with a large number of NUMA nodes is small and shrinking;
it's not something that the developers need be concerned about. And, says
Andi, if things get really bad, the sparse memory section number can be
moved into a separate array like the NUMA node number. Given this view of
the problem, worries about adding a useful new feature over concerns about
a single page flag bit seem misplaced.
Nobody has challenged Andi's view that the problem is not as severe as most
people think, though Andrew Morton has hinted that Andi should go ahead and prove his
ideas about moving the section number out of the page structure.
That might not be a bad idea. Even if page flags are a little more
abundant than most developers think, it still is not hard to foresee a time
when they are exhausted, at least on 32-bit systems. Proposals involving
new page flags are not particularly rare; unless we want to restrict
features needing page flags to 64-bit systems, we'll need to make some more
flags available before too long.
Comments (7 posted)
Your editor is widely known for his invariably correct and infallible
predictions. So, certainly, he would never have said something like this
Mistakes may have been made in Xen's history, but it is a project
which remains alive, and which has clear reasons to exist. Your
editor predicts that the Dom0 code will find little opposition at
the opening of the 2.6.30 merge window.
OK, anybody needing any further evidence of your editor's ability to
foresee the future need only look at his investment portfolio...or, shall
we say, the smoldering remains thereof. Needless to say, Xen Dom0 support
did not get through the 2.6.30 merge window, and it's not looking very good
for 2.6.31 either.
Dom0, remember, is the hypervisor portion of the Xen system; it's the One
Ring which binds all the others. Unlike the DomU support (used for
ordinary guests), Dom0 remains outside of the mainline kernel. So anybody
who ships it must patch it in separately; for a patch as large and
intrusive as Dom0, that is not a pleasant task. It is a necessary one,
though; Xen has a lot of users. As expressed by Xen hacker Jeremy Fitzhardinge:
Xen is very widely used. There are at least 500k servers running
Xen in commercial user sites (and untold numbers of smaller sites
and personal users), running millions of virtual guest domains.
If you browse the net at all widely, you're likely to be using a
Xen-based server; all of Amazon runs on Xen, for example. Mozilla
and Debian are hosted on Xen systems.
Xen developers and users would all like to see that code merged into the
mainline. A number of otherwise uninvolved kernel developers have also
argued in favor of merging this code. So one might well wonder why there
is still opposition.
One problem is a fundamental disagreement with the Xen design, which calls
for a separate user-space hypervisor component. To some developers, it
looks like an unfortunate mishmash of code in the mainline kernel, in
Xen-specific kernel code, and in user space - with, of course, a
set-in-concrete user-space ABI in the middle. Many developers are more
comfortable with the fully in-kernel hypervisor approach taken by KVM.
Thomas Gleixner is especially worried about
the possible results of merging the Xen Dom0 code for this reason (among
Aside of that it can also hinder the development of a properly
designed hypervisor in Linux: 'why bother with that new stuff, it
might be cleaner and nicer, but we have this Xen dom0 stuff
Steven Rostedt, who has worked on Xen in the past, also dislikes the hypervisor design and the
effects it has on kernel development:
The major difference between KVM and Xen is that KVM _is_ part of
Linux. Xen is not. The reason that this matters is that if we need
to make a change to the way Linux works we can simply make KVM
handle the change. That is, you could think of it as Dom0 and the
hypervisor would always be in sync.
If we were to break an interface with Dom0 for Xen then we would
have a bunch of people crying foul about us breaking a defined
API. One of Thomas's complaints (and a valid one) is that once
Linux supports an external API it must always keep it
compatible. This will hamper new development in Linux if the APIs
are scattered throughout the kernel without much thought.
Steven suggests merging the Xen hypervisor into the mainline so that it's
all part of Linux, and to make the hypervisor ABI an internal, changeable
interface. Some other developers - generally those most hostile to merging
Dom0 in its current form - supported this idea.
It's certainly not the first time
that this sort of idea has been raised. But, despite many calls to bring some of the
"plumbing layer" into the kernel proper, that has yet to happen; it seems
unlikely that something as large as Xen would be the first user-space
component to break through
that barrier - even if the Xen developers were amenable to that approach.
The hypervisor design would probably not be an insurmountable obstacle to
merging by itself. But there are other complaints. The maintainers of the
x86 architecture dislike the changes made to their code by the Dom0
patches. By their reckoning, there are far too many
"if (xen)..." conditionals and too many #ifdefs.
They would very much like to see the Xen code cleaned up and made less
intrusive into the core x86 code. Linus supports them on this point:
The fact is (and this is a _fact_): Xen is a total mess from a
development standpoint. I talked about this in private with
Jeremy. Xen pollutes the architecture code in ways that NO OTHER
subsystem does. And I have never EVER seen the Xen developers
really acknowledge that and try to fix it.
The Xen cause was also not helped by some
performance numbers posted by Ingo Molnar. If you choose the right
benchmark, it seems, you can show that the paravirt_ops layer imposes a 1%
overhead on kernel performance. Paravirt_ops is the code which abstracts
low-level machine operations; it can enable the same kernel to run either
on "bare metal" or virtualized under a hypervisor. It adds a layer of
indirect function calls where, before, inline code was used.
Those function calls come at a cost which has now been quantified by Ingo
(but one should note that Rusty Russell has shown that, with the right benchmark, a number
of other common configuration options have a much higher cost).
The problem here is not that Xen users have a slower kernel; the real issue
is that any kernel which might ever be run under Xen must be built with
paravirt_ops enabled. There are few things which make distributors' lives
more miserable than forcing them to build, ship, and support another kernel
configuration. So most distributor kernels run with paravirt_ops enabled;
that means that all users, regardless of whether they have any interest in
Xen, pay the price. In some cases, that cost is too high; Nick Piggin said:
FWIW, we had to disable paravirt in our default SLES11 kernel.
(admittedly this was before some of the recent improvements were
made). But there are only so many 1% performance regressions you
can introduce before customers won't upgrade (or vendors won't
publish benchmarks with the new software).
Ingo is strongly critical of the perceived cost of paravirt_ops, but he also proposes a solution:
Note what _is_ acceptable and what _is_ doable is to be a bit more
inventive when dumping this optional, currently-high-overhead
paravirt feature on us. My message to Xen folks is: use dynamic
patching, fix your hypervisor and just use plain old-fashioned
_restraint_ and common sense when engineering things, and for
heaven's sake, _care_ about the native kernel's performance because
in the long run it's your bread and butter too.
He goes on to say that merging Dom0 now would only make things worse; it
would give the Xen developers less incentive to fix the problems while,
simultaneously, making it harder for distributors to disable paravirt_ops
in their kernels.
And that, perhaps, leads to the fundamental disconnect in this discussion.
There are two distinctive lines of thought with regard to when code with
known problems should be merged:
- Some developers point out that code which is in the mainline benefits
from the attention of a much wider pool of developers and improves
much more quickly. It is easy to find examples of code which, after
languishing for years out of the mainline, improved quickly after
being merged. This is the reasoning behind the -staging tree and the
general policy toward merging drivers sooner rather than later.
- Some developers - sometimes, amusingly, the same developers - say, instead, that the
best time to get fundamental problems fixed is before merging. This
is undoubtedly true for user-space ABI issues; those often cannot be
fixed at all after they have been shipped in a stable kernel. But
holding code out of the mainline is also a powerful lever which
subsystem maintainers can employ to motivate developers to fix
problems. Once the code is merged, that particular tool is no longer
Both of these themes run through the Xen discussion. There is no doubt
that the Xen Dom0 code would see more eyeballs - and patches - after being
merged. So some developers think that the right thing to do is to merge
this much-requested feature, then fix it up afterward. Chris Mason put it this way:
The idea that we should take code that is heavily used is
important. The best place to fix xen is in the kernel. It always
has been, and keeping it out is just making it harder on everyone
But the stronger voice looks to be the one saying that the problems need to
be fixed first. The deciding factors seem to be (1) the user-space
ABI, and (2) the intrusion into the core x86 code; those issues make
Xen different from yet another driver or filesystem. That, in turn,
suggests that the Dom0 code is not destined for the mainline anytime soon.
Instead, the Xen developers will be expected to go back and fix a list of
problems - a lot of work with an uncertain result at the end.
Comments (53 posted)
Last week's Security page looked at some
recently proposed patches that would "sanitize" kernel memory by clearing
it as it was freed. At that time, a second version of the patches which
unconditionally cleared memory when freed—dependent on the
sanitize_mem boot parameter—was generally well received.
But, perhaps folks just had not yet had a chance to look. Over the
last week, multiple objections have been raised, which were mostly met with
belligerent responses from developer Larry Highsmith. In many ways, this
is starting to look like yet another lesson in "how not to work with the
The basic problem is that data can persist in memory long after that memory
is freed. Sometimes that data contains passwords, cryptographic keys,
confidential documents, etc., but it is impossible for the kernel to know,
in the general case, which pages are sensitive. By clearing memory when it
is deallocated, the lifetime of this potentially sensitive data can be
reduced. A research paper
describes some experiments that showed memory values persisting for days
and even weeks on Linux systems. A bug in the kernel that leaked memory
information could potentially leak these values to attackers.
So, Highsmith proposed adding a memory sanitization feature that has long
been a part of the patches applied to the kernel by the PaX security project. There is
clearly a performance impact to clearing memory as it is reclaimed, but,
since memory is cleared as it is allocated (to avoid obvious information leaks),
the impact may not be as large as it seems at first glance. As Arjan van
de Ven points out:
.. and if we zero on free, we don't need to zero on allocate.
While this is a little controversial, it does mean that at least part of
the cost is just time-shifted, which means it'll not be TOO bad
Peter Zijlstra is concerned about the cache
effects: "zero on allocate has the advantage of cache hotness, we're
going to use the memory, why else allocate it. [...] zero on free only
causes extra cache evictions for no gain." But van de Ven describes how he sees the caches being
affected, concluding: "Don't get me wrong, I'm not arguing that
zero-on-free is better, I'm
just trying to point out that the 'advantage' of zero-on-allocate isn't
nearly as big as people sometimes think it is..."
But some, like Alan Cox, think the
performance impact is immaterial: "If you need this kind of data wiping then the performance hit
is basically irrelevant, the security comes first." Zijlstra
and others are concerned about the price that is paid by all kernel
users, even those who have not enabled sanitize_mem. He notes that the patches would add extra
function calls and branches even when the feature is not enabled.
Suggestions were made to benchmark the proposed code against the existing
implementation, but that is where the conversation started to go off the rails.
Highsmith obviously gets frustrated with the direction of the
discussion, but rather than stepping back, he lashes out. There is
certainly some provocation in the thread, Zijlstra's "Really, get a life, go fix real bugs. Don't make our kernel slower for
wanking rights." comment certainly
didn't help. But Highsmith needs to recognize that he is the one trying to
get something added to the kernel, so the burden of "proof" is on him.
Instead, his condescending manner seems to indicate that he feels like he
is presenting the kernel community with a gift—one they are too
slow-witted to understand.
An important characteristic for kernel contributors is that they work well
with the rest of the community: answer questions, respond to code review
suggestions, etc. When that doesn't happen, patches tend to be ignored,
regardless of their technical merit, and Highsmith seems headed down that
path. When it was suggested that using kzfree() on specific
kernel allocations for sensitive data—which would clear the memory,
then free it—Highsmith responded:
That's hopeless, and kzfree is broken. Like I said in my earlier reply,
please test that yourself to see the results. Whoever wrote that ignored
how SLAB/SLUB work and if kzfree had been used somewhere in the kernel
before, it should have been noticed [a] long time ago.
Since Highsmith was responding to SLAB maintainer Pekka Enberg's
suggestion, that response—even if true—probably wasn't the right
approach. Enberg and others asked specifically about the problems in
kzfree(), but the response from
Highsmith was a combination of condescension and vagueness. As soon as
Enberg and Ingo Molnar tried to pin down where those problems are, Highsmith
went off on a rant about the SLOB memory
In addition, Molnar has pointed out that
some of the same sensitive values can have long lifetimes on the kernel
Long-lived tasks that touched any crypto path (or other sensitive
data in the kernel) and leaked it to the kernel stack can possibly
keep sensitive information there indefinitely (especially if that
information got there in an accidentally deep stack context) - up
until the task exits. That information will outlive the freeing and
sanitizing of the original sensitive data.
Rather than recognize this as an additional area that needs addressing,
Highsmith just continues his tirade:
But you and the other cabal of vagueness have only sent mostly useless
comments, outright uncivil responses, obvious misdirection attempts,
unfounded critics, etc. I haven't seen more fallacies put together since
the last time I read an unreleased film script by Jerry Lewis.
Overall, the idea of clearing memory as it is freed based on a boot time
flag is reasonable. Several kernel hackers, including Cox and Rik van
Riel, have expressed interest in seeing the feature added. With some
effort, it would seem that the performance cost for the disabled case could
be reduced to an acceptable level, but if the main proponent is spending
his time fighting and flaming, it seems unlikely that it will ever get
A newer set of patches, which just use kzfree() in specific
sensitive places (tty buffer management, 802.11 key handling, and the crypto API) were also proposed by Highsmith,
but Linus Torvalds was not particularly impressed. There was no need to use
kzfree() there, a simple memset() was sufficient.
Torvalds was not necessarily a believer in the need for the patches, nor
for how Highsmith responded to review:
but quite frankly, I'm not
convinced about these patches at all.
I'm also not in the least convinced about how you just dismiss everybodys
There were some additional technical complaints about the patches as
well, particularly the use of kzfree() everywhere in the crypto
API patch. Crypto API maintainer Herbert Xu noted: "The zeroing of metadata is
gratuitous." Overall, they had the look of being created
grudgingly—as if it were a favor to do so.
Where things go from here is unclear. Highsmith seemed to possibly be
signing off in his reply to Torvalds:
"The next time a kernel vulnerability appears that is remotely
some of the venues of attack I've commented, it will be useful to be
able to refer to these responses." There is some justification for
Highsmith's frustration, but he needs to see that it isn't going to do him
(or the kernel) any good.
Kernel contributors, especially new ones, need to recognize that the community
has folks that are at least as smart as they are. In this case, some of
those developers may not have the security focus that Highsmith does, but
that doesn't reduce their understanding of the kernel, nor their interest
in seeing it have patches applied for better security. It would be
unfortunate to see this feature, which could be very useful in some
environments, fall by the wayside.
Comments (11 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>