Kernel development
Brief items
Kernel release status
The current development kernel is 2.6.30-rc8, released on June 2. It is probably the last prepatch before the final 2.6.30 release. "A lot of small stuff, fixing a few regressions (and at least one bugzilla entry going back to 2.6.24). The small stuff does matter. Please test." Full details can be found in the long-format changelog.
There have been no stable releases over the last week; the last stable update was 2.6.29.4 on May 19.
Kernel development news
Quotes of the week
In brief
Retrying core dump writes: Paul Smith posted a patch that would retry short or interrupted
writes while dumping core, thus preventing the creation of an incomplete
core dump when a signal arrives. Alan Cox NAK-ed the patch noting: "The existing behaviour is an absolute godsend when you've something like
a core dump stuck on an NFS mount or something trying to core dump to
very slow media.
" But the idea did lead to some interesting
discussion of which signals should cause a core dump to be
interrupted—thus leaving a short core file—and which should be
ignored.
There is an inherent difference between some interactive program
that is dumping core which a user might wish to interrupt with
SIGINT versus a non-interactive process which the user or
developer might wish to finish its core dump.
Smith describes one scenario: "a worker process might appear unresponsive due to a core being dumped
and the parent would decide to shoot it with SIGINT based on various
timeouts etc.
" No decision was made, but Roland McGrath analyzed four signal categories and noted that
at least two of the categories needed to be addressed as they are
mishandled by the current code.
Device tree. The Open Firmware "device tree" is a description of a system's hardware configuration in a standardized data structure. Some platforms have used device trees to separate the description of the hardware from the kernel running on that hardware; that, in turn, allows one kernel to support a wider variety of systems. Janboe Ye recently proposed adding device tree support to the ARM architecture, which arguably supports the widest variety of hardware of all. That has, in turn, led to a long discussion of how much device tree really helps, and how feasible it is to create a single kernel for all systems of a given architecture.
Developers of architectures using device tree seem to be happy with the results; see this 2008 OLS paper [PDF] for a description of how things went with the PowerPC architecture. Maintainers of other architectures are less convinced, though. ARM maintainer Russell King worries that device tree could turn out to be an expensive dead end; he would like to see a subset of ARM architectures converted first to find out whether it is likely to work well or not. An incremental approach probably makes sense in general, so that's how things are likely to go.
The "host protected area" is an IDE concept which allows a controller to hide a portion of a drive from the operating system's view. When HPA was introduced years ago, its primary use was to make large drives (by the standards of the day) appear small so that certain legacy operating systems would not be confused. Linux, naturally, never had any such problem, so the Linux IDE layer would traditionally disable the HPA during the probing process. That was the right thing to do at the time; it allowed Linux systems to make use of the entire drive.
It has been a while since operating systems required protection from the shock of seeing an overly-large drive. But the HPA remains for different reasons. Vendors will use the HPA to stash RAID information, for example. Windows systems often come with a full "reinstall this system from the beginning" recovery image - apparently a useful feature on that platform. Rootkits sometimes hide information there. And so on. In all cases but the last, it is probably a mistake for the operating system to overwrite the HPA on contemporary systems. So turning off HPA protection by default is no longer the right thing to do.
The libata driver subsystem has observed the HPA since the beginning, but the IDE code retains its old default. That could change, though, with a patch set posted by IDE maintainer Bartlomiej Zolnierkiewicz. These patches will cause the IDE layer to preserve the HPA by default - unless the drive has partitions which cover the HPA already. That test should be enough to ensure that older systems continue to function while avoiding trashing the HPA on newer drives. For systems not properly covered by this change, the nohpa module parameter can be used to control HPA behavior directly.
reflink(). There's another reflink() proposal out there. This one simplifies the preserve argument slightly, replacing the set of flags with an all-or-none option for now. So reflink() can be used in the full snapshot mode (with suitable privilege) or in the reflink-as-copy mode, but with no options in between.
Control over process IDs. The proposed checkpoint/restart feature has a number of challenges to overcome. One of those is that processes can become very confused if their process ID changes suddenly. So restarting a checkpointed process requires that the process's old ID be restored as well. The use of PID namespaces can help to ensure that the requisite IDs are available, but there's no way in Linux to request that a process be started with a specific ID.
Sukadev Bhattiprolu has a proposal for a new system call to address this problem: clone_with_pids(). It would behave like ordinary clone(), but it takes an additional argument being an array of process IDs. The array contains one desired process ID for each namespace in the current hierarchy, with the first being the global namespace. Deeply-nested processes can, thus, be created with a specific ID in each namespace where it will appear.
This patch has been "gently tested" and not posted outside of the containers list, so it has seen relatively little review thus far. Expect some changes if this code starts to get closer to the mainline.
How many page flags do we really have?
The recently-discussed kernel memory sanitization patch was criticized on a number of points, one of which was its use of a dedicated page flag. Andi Kleen's HWPOISON patch (enabling upcoming Intel CPU features for dealing with memory errors) have run into trouble on similar grounds. The desperate shortage of page flags has been an article of faith among kernel developers for years. But, interestingly, not everybody agrees that a problem exists, and almost nobody can answer the simple question of how many flags are available in the first place. So a look at the Linux page flags issue seems in order."Page flags" are simple bit flags describing the state of a page of physical memory. They are defined in <linux/page-flags.h>. Flags exist to mark "reserved" pages (kernel memory, I/O memory, or simply nonexistent), locked pages, those under writeback I/O, those which are part of a compound page, pages managed by the slab allocator, and more. Depending on the target architecture and kernel configuration options selected, there can be as many as 24 individual flags defined.
These flags live in the flags field of struct page. This field is declared to be an unsigned long, so one might think that figuring out how much space is left for new flags would be a straightforward task. To a casual observer, it would look like, on a 32-bit system, 24 flags have been used, leaving eight available:
In other words, the situation is starting to get tight, but it is not a crisis quite yet.![]()
But little is straightforward when it comes to struct page. One of these structures exists for every physical page in the system; on a 4GB system, there will be one million page structures. Given that every byte added to struct page is amplified a million times, it is not surprising that there is a strong motivation to avoid growing this structure at any cost. So struct page contains no less than three unions and is surrounded by complicated rules describing which fields are valid at which times. Changes to how this structure is accessed must be made with great care.
Unions are not the only technique used to shoehorn as much information as possible into this small structure. Non-uniform memory access (NUMA) systems need to track information on which node each page belongs to, and which zone within the node as well. Rather than add fields to struct page, the NUMA hackers grabbed the free bits at the top of the flags field, yielding something like this:
So, on a 32-bit system with 24 page flags defined (a pessimistic scenario), there are eight bits available for the node and zone information, practically limiting 32-bit NUMA systems to 64 nodes, which is almost certainly adequate. But the addition of more page flags would come at the cost of supporting fewer NUMA nodes, and that would be unwelcome.
Things get worse on systems with complicated physical memory layouts. On such systems, memory is not organized into a single, continuous range of physical addresses; instead, it is spread out with holes in the middle. Memory management on these "sparse memory" systems requires that each page have a "section" number associated with it. That section number is stored - you guessed it - in the spare bits at the top of the flags field. If space gets too tight, the kernel will move the node number into a separate array, slowing things down in the process. Either way, it seems clear that there is not a whole lot of spare room in the flags field on these systems.
So the real answer to "how many page flags are free?" is, for all practical purposes, "zero," at least on 32-bit NUMA systems. Making room for more would require expanding struct page, which is a heavy cost to pay. Developers should, thus, not be surprised when proposals to use new page flags run into stiff opposition. It's only one bit, but that bit is in the middle of some of the most sought-after real estate in the entire kernel.
In the case of Andi's HWPOISON patch, this opposition has come in the form of a number of alternative suggestions. One was to simply use the "reserved" bit, but that could lead to difficulties in parts of the code where that usage is not expected. Then it was suggested that the combination of the "reserved" and "writeback" flags could indicate a poisoned page, but Andi claims that this approach cannot work. Andrew Morton has suggested that HWPOISON could be made into a 64-bit-only feature; Andi allows as to how that might be possible, but he clearly doesn't like the idea.
Instead, Andi takes the position that the page flag shortage does not really exist. It's not a problem at all on 64-bit systems, where unsigned long is twice as wide. The number of 32-bit systems with a large number of NUMA nodes is small and shrinking; it's not something that the developers need be concerned about. And, says Andi, if things get really bad, the sparse memory section number can be moved into a separate array like the NUMA node number. Given this view of the problem, worries about adding a useful new feature over concerns about a single page flag bit seem misplaced.
Nobody has challenged Andi's view that the problem is not as severe as most people think, though Andrew Morton has hinted that Andi should go ahead and prove his ideas about moving the section number out of the page structure. That might not be a bad idea. Even if page flags are a little more abundant than most developers think, it still is not hard to foresee a time when they are exhausted, at least on 32-bit systems. Proposals involving new page flags are not particularly rare; unless we want to restrict features needing page flags to 64-bit systems, we'll need to make some more flags available before too long.
Xen again
Your editor is widely known for his invariably correct and infallible predictions. So, certainly, he would never have said something like this:
OK, anybody needing any further evidence of your editor's ability to foresee the future need only look at his investment portfolio...or, shall we say, the smoldering remains thereof. Needless to say, Xen Dom0 support did not get through the 2.6.30 merge window, and it's not looking very good for 2.6.31 either.
Dom0, remember, is the hypervisor portion of the Xen system; it's the One Ring which binds all the others. Unlike the DomU support (used for ordinary guests), Dom0 remains outside of the mainline kernel. So anybody who ships it must patch it in separately; for a patch as large and intrusive as Dom0, that is not a pleasant task. It is a necessary one, though; Xen has a lot of users. As expressed by Xen hacker Jeremy Fitzhardinge:
Xen developers and users would all like to see that code merged into the mainline. A number of otherwise uninvolved kernel developers have also argued in favor of merging this code. So one might well wonder why there is still opposition.
One problem is a fundamental disagreement with the Xen design, which calls for a separate user-space hypervisor component. To some developers, it looks like an unfortunate mishmash of code in the mainline kernel, in Xen-specific kernel code, and in user space - with, of course, a set-in-concrete user-space ABI in the middle. Many developers are more comfortable with the fully in-kernel hypervisor approach taken by KVM. Thomas Gleixner is especially worried about the possible results of merging the Xen Dom0 code for this reason (among several others):
Steven Rostedt, who has worked on Xen in the past, also dislikes the hypervisor design and the effects it has on kernel development:
If we were to break an interface with Dom0 for Xen then we would have a bunch of people crying foul about us breaking a defined API. One of Thomas's complaints (and a valid one) is that once Linux supports an external API it must always keep it compatible. This will hamper new development in Linux if the APIs are scattered throughout the kernel without much thought.
Steven suggests merging the Xen hypervisor into the mainline so that it's all part of Linux, and to make the hypervisor ABI an internal, changeable interface. Some other developers - generally those most hostile to merging Dom0 in its current form - supported this idea. It's certainly not the first time that this sort of idea has been raised. But, despite many calls to bring some of the "plumbing layer" into the kernel proper, that has yet to happen; it seems unlikely that something as large as Xen would be the first user-space component to break through that barrier - even if the Xen developers were amenable to that approach.
The hypervisor design would probably not be an insurmountable obstacle to merging by itself. But there are other complaints. The maintainers of the x86 architecture dislike the changes made to their code by the Dom0 patches. By their reckoning, there are far too many "if (xen)..." conditionals and too many #ifdefs. They would very much like to see the Xen code cleaned up and made less intrusive into the core x86 code. Linus supports them on this point:
The Xen cause was also not helped by some performance numbers posted by Ingo Molnar. If you choose the right benchmark, it seems, you can show that the paravirt_ops layer imposes a 1% overhead on kernel performance. Paravirt_ops is the code which abstracts low-level machine operations; it can enable the same kernel to run either on "bare metal" or virtualized under a hypervisor. It adds a layer of indirect function calls where, before, inline code was used. Those function calls come at a cost which has now been quantified by Ingo (but one should note that Rusty Russell has shown that, with the right benchmark, a number of other common configuration options have a much higher cost).
The problem here is not that Xen users have a slower kernel; the real issue is that any kernel which might ever be run under Xen must be built with paravirt_ops enabled. There are few things which make distributors' lives more miserable than forcing them to build, ship, and support another kernel configuration. So most distributor kernels run with paravirt_ops enabled; that means that all users, regardless of whether they have any interest in Xen, pay the price. In some cases, that cost is too high; Nick Piggin said:
Ingo is strongly critical of the perceived cost of paravirt_ops, but he also proposes a solution:
He goes on to say that merging Dom0 now would only make things worse; it would give the Xen developers less incentive to fix the problems while, simultaneously, making it harder for distributors to disable paravirt_ops in their kernels.
And that, perhaps, leads to the fundamental disconnect in this discussion. There are two distinctive lines of thought with regard to when code with known problems should be merged:
- Some developers point out that code which is in the mainline benefits
from the attention of a much wider pool of developers and improves
much more quickly. It is easy to find examples of code which, after
languishing for years out of the mainline, improved quickly after
being merged. This is the reasoning behind the -staging tree and the
general policy toward merging drivers sooner rather than later.
- Some developers - sometimes, amusingly, the same developers - say, instead, that the best time to get fundamental problems fixed is before merging. This is undoubtedly true for user-space ABI issues; those often cannot be fixed at all after they have been shipped in a stable kernel. But holding code out of the mainline is also a powerful lever which subsystem maintainers can employ to motivate developers to fix problems. Once the code is merged, that particular tool is no longer available.
Both of these themes run through the Xen discussion. There is no doubt that the Xen Dom0 code would see more eyeballs - and patches - after being merged. So some developers think that the right thing to do is to merge this much-requested feature, then fix it up afterward. Chris Mason put it this way:
But the stronger voice looks to be the one saying that the problems need to be fixed first. The deciding factors seem to be (1) the user-space ABI, and (2) the intrusion into the core x86 code; those issues make Xen different from yet another driver or filesystem. That, in turn, suggests that the Dom0 code is not destined for the mainline anytime soon. Instead, the Xen developers will be expected to go back and fix a list of problems - a lot of work with an uncertain result at the end.
Page sanitization, part 2
Last week's Security page looked at some recently proposed patches that would "sanitize" kernel memory by clearing it as it was freed. At that time, a second version of the patches which unconditionally cleared memory when freed—dependent on the sanitize_mem boot parameter—was generally well received. But, perhaps folks just had not yet had a chance to look. Over the last week, multiple objections have been raised, which were mostly met with belligerent responses from developer Larry Highsmith. In many ways, this is starting to look like yet another lesson in "how not to work with the kernel community".
The basic problem is that data can persist in memory long after that memory is freed. Sometimes that data contains passwords, cryptographic keys, confidential documents, etc., but it is impossible for the kernel to know, in the general case, which pages are sensitive. By clearing memory when it is deallocated, the lifetime of this potentially sensitive data can be reduced. A research paper describes some experiments that showed memory values persisting for days and even weeks on Linux systems. A bug in the kernel that leaked memory information could potentially leak these values to attackers.
So, Highsmith proposed adding a memory sanitization feature that has long been a part of the patches applied to the kernel by the PaX security project. There is clearly a performance impact to clearing memory as it is reclaimed, but, since memory is cleared as it is allocated (to avoid obvious information leaks), the impact may not be as large as it seems at first glance. As Arjan van de Ven points out:
Peter Zijlstra is concerned about the cache
effects: "zero on allocate has the advantage of cache hotness, we're
going to use the memory, why else allocate it. [...] zero on free only
causes extra cache evictions for no gain.
" But van de Ven describes how he sees the caches being
affected, concluding: "Don't get me wrong, I'm not arguing that
zero-on-free is better, I'm
just trying to point out that the 'advantage' of zero-on-allocate isn't
nearly as big as people sometimes think it is...
"
But some, like Alan Cox, think the
performance impact is immaterial: "If you need this kind of data wiping then the performance hit
is basically irrelevant, the security comes first.
" Zijlstra
and others are concerned about the price that is paid by all kernel
users, even those who have not enabled sanitize_mem. He notes that the patches would add extra
function calls and branches even when the feature is not enabled.
Suggestions were made to benchmark the proposed code against the existing
implementation, but that is where the conversation started to go off the rails.
Highsmith obviously gets frustrated with the direction of the
discussion, but rather than stepping back, he lashes out. There is
certainly some provocation in the thread, Zijlstra's "Really, get a life, go fix real bugs. Don't make our kernel slower for
wanking rights.
" comment certainly
didn't help. But Highsmith needs to recognize that he is the one trying to
get something added to the kernel, so the burden of "proof" is on him.
Instead, his condescending manner seems to indicate that he feels like he
is presenting the kernel community with a gift—one they are too
slow-witted to understand.
An important characteristic for kernel contributors is that they work well with the rest of the community: answer questions, respond to code review suggestions, etc. When that doesn't happen, patches tend to be ignored, regardless of their technical merit, and Highsmith seems headed down that path. When it was suggested that using kzfree() on specific kernel allocations for sensitive data—which would clear the memory, then free it—Highsmith responded:
Since Highsmith was responding to SLAB maintainer Pekka Enberg's suggestion, that response—even if true—probably wasn't the right approach. Enberg and others asked specifically about the problems in kzfree(), but the response from Highsmith was a combination of condescension and vagueness. As soon as Enberg and Ingo Molnar tried to pin down where those problems are, Highsmith went off on a rant about the SLOB memory allocator.
In addition, Molnar has pointed out that some of the same sensitive values can have long lifetimes on the kernel stack:
Rather than recognize this as an additional area that needs addressing, Highsmith just continues his tirade:
Overall, the idea of clearing memory as it is freed based on a boot time flag is reasonable. Several kernel hackers, including Cox and Rik van Riel, have expressed interest in seeing the feature added. With some effort, it would seem that the performance cost for the disabled case could be reduced to an acceptable level, but if the main proponent is spending his time fighting and flaming, it seems unlikely that it will ever get merged.
A newer set of patches, which just use kzfree() in specific sensitive places (tty buffer management, 802.11 key handling, and the crypto API) were also proposed by Highsmith, but Linus Torvalds was not particularly impressed. There was no need to use kzfree() there, a simple memset() was sufficient. Torvalds was not necessarily a believer in the need for the patches, nor for how Highsmith responded to review:
There were some additional technical complaints about the patches as
well, particularly the use of kzfree() everywhere in the crypto
API patch. Crypto API maintainer Herbert Xu noted: "The zeroing of metadata is
gratuitous.
" Overall, they had the look of being created
grudgingly—as if it were a favor to do so.
Where things go from here is unclear. Highsmith seemed to possibly be
signing off in his reply to Torvalds:
"The next time a kernel vulnerability appears that is remotely
related to
some of the venues of attack I've commented, it will be useful to be
able to refer to these responses.
" There is some justification for
Highsmith's frustration, but he needs to see that it isn't going to do him
(or the kernel) any good.
Kernel contributors, especially new ones, need to recognize that the community has folks that are at least as smart as they are. In this case, some of those developers may not have the security focus that Highsmith does, but that doesn't reduce their understanding of the kernel, nor their interest in seeing it have patches applied for better security. It would be unfortunate to see this feature, which could be very useful in some environments, fall by the wayside.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
