LWN.net Logo

Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.30-rc8, released on June 2. It is probably the last prepatch before the final 2.6.30 release. "A lot of small stuff, fixing a few regressions (and at least one bugzilla entry going back to 2.6.24). The small stuff does matter. Please test." Full details can be found in the long-format changelog.

There have been no stable releases over the last week; the last stable update was 2.6.29.4 on May 19.

Comments (none posted)

Kernel development news

Quotes of the week

fyi, the above discussion transitions akpm into the "confused" state. I'll keep the patch on hold until akpm transitions back out of that state.
-- Andrew "akpm" Morton

Because, when you think about it, there's really no merit in having consistently wrong code. A mix of right and wrong is better than 100% wrong.
-- Andrew Morton

The Dom0 push of Xen just seems too much like Linux being Xen's sex slave, when it should be the other way around.
-- Steven Rostedt

Comments (none posted)

In brief

By Jonathan Corbet
June 3, 2009

Retrying core dump writes: Paul Smith posted a patch that would retry short or interrupted writes while dumping core, thus preventing the creation of an incomplete core dump when a signal arrives. Alan Cox NAK-ed the patch noting: "The existing behaviour is an absolute godsend when you've something like a core dump stuck on an NFS mount or something trying to core dump to very slow media." But the idea did lead to some interesting discussion of which signals should cause a core dump to be interrupted—thus leaving a short core file—and which should be ignored.

There is an inherent difference between some interactive program that is dumping core which a user might wish to interrupt with SIGINT versus a non-interactive process which the user or developer might wish to finish its core dump. Smith describes one scenario: "a worker process might appear unresponsive due to a core being dumped and the parent would decide to shoot it with SIGINT based on various timeouts etc." No decision was made, but Roland McGrath analyzed four signal categories and noted that at least two of the categories needed to be addressed as they are mishandled by the current code.

Device tree. The Open Firmware "device tree" is a description of a system's hardware configuration in a standardized data structure. Some platforms have used device trees to separate the description of the hardware from the kernel running on that hardware; that, in turn, allows one kernel to support a wider variety of systems. Janboe Ye recently proposed adding device tree support to the ARM architecture, which arguably supports the widest variety of hardware of all. That has, in turn, led to a long discussion of how much device tree really helps, and how feasible it is to create a single kernel for all systems of a given architecture.

Developers of architectures using device tree seem to be happy with the results; see this 2008 OLS paper [PDF] for a description of how things went with the PowerPC architecture. Maintainers of other architectures are less convinced, though. ARM maintainer Russell King worries that device tree could turn out to be an expensive dead end; he would like to see a subset of ARM architectures converted first to find out whether it is likely to work well or not. An incremental approach probably makes sense in general, so that's how things are likely to go.

The "host protected area" is an IDE concept which allows a controller to hide a portion of a drive from the operating system's view. When HPA was introduced years ago, its primary use was to make large drives (by the standards of the day) appear small so that certain legacy operating systems would not be confused. Linux, naturally, never had any such problem, so the Linux IDE layer would traditionally disable the HPA during the probing process. That was the right thing to do at the time; it allowed Linux systems to make use of the entire drive.

It has been a while since operating systems required protection from the shock of seeing an overly-large drive. But the HPA remains for different reasons. Vendors will use the HPA to stash RAID information, for example. Windows systems often come with a full "reinstall this system from the beginning" recovery image - apparently a useful feature on that platform. Rootkits sometimes hide information there. And so on. In all cases but the last, it is probably a mistake for the operating system to overwrite the HPA on contemporary systems. So turning off HPA protection by default is no longer the right thing to do.

The libata driver subsystem has observed the HPA since the beginning, but the IDE code retains its old default. That could change, though, with a patch set posted by IDE maintainer Bartlomiej Zolnierkiewicz. These patches will cause the IDE layer to preserve the HPA by default - unless the drive has partitions which cover the HPA already. That test should be enough to ensure that older systems continue to function while avoiding trashing the HPA on newer drives. For systems not properly covered by this change, the nohpa module parameter can be used to control HPA behavior directly.

reflink(). There's another reflink() proposal out there. This one simplifies the preserve argument slightly, replacing the set of flags with an all-or-none option for now. So reflink() can be used in the full snapshot mode (with suitable privilege) or in the reflink-as-copy mode, but with no options in between.

Control over process IDs. The proposed checkpoint/restart feature has a number of challenges to overcome. One of those is that processes can become very confused if their process ID changes suddenly. So restarting a checkpointed process requires that the process's old ID be restored as well. The use of PID namespaces can help to ensure that the requisite IDs are available, but there's no way in Linux to request that a process be started with a specific ID.

Sukadev Bhattiprolu has a proposal for a new system call to address this problem: clone_with_pids(). It would behave like ordinary clone(), but it takes an additional argument being an array of process IDs. The array contains one desired process ID for each namespace in the current hierarchy, with the first being the global namespace. Deeply-nested processes can, thus, be created with a specific ID in each namespace where it will appear.

This patch has been "gently tested" and not posted outside of the containers list, so it has seen relatively little review thus far. Expect some changes if this code starts to get closer to the mainline.

Comments (5 posted)

How many page flags do we really have?

By Jonathan Corbet
June 3, 2009
The recently-discussed kernel memory sanitization patch was criticized on a number of points, one of which was its use of a dedicated page flag. Andi Kleen's HWPOISON patch (enabling upcoming Intel CPU features for dealing with memory errors) have run into trouble on similar grounds. The desperate shortage of page flags has been an article of faith among kernel developers for years. But, interestingly, not everybody agrees that a problem exists, and almost nobody can answer the simple question of how many flags are available in the first place. So a look at the Linux page flags issue seems in order.

"Page flags" are simple bit flags describing the state of a page of physical memory. They are defined in <linux/page-flags.h>. Flags exist to mark "reserved" pages (kernel memory, I/O memory, or simply nonexistent), locked pages, those under writeback I/O, those which are part of a compound page, pages managed by the slab allocator, and more. Depending on the target architecture and kernel configuration options selected, there can be as many as 24 individual flags defined.

These flags live in the flags field of struct page. This field is declared to be an unsigned long, so one might think that figuring out how much space is left for new flags would be a straightforward task. To a casual observer, it would look like, on a 32-bit system, 24 flags have been used, leaving eight available:

[Page
flags]
In other words, the situation is starting to get tight, but it is not a crisis quite yet.

But little is straightforward when it comes to struct page. One of these structures exists for every physical page in the system; on a 4GB system, there will be one million page structures. Given that every byte added to struct page is amplified a million times, it is not surprising that there is a strong motivation to avoid growing this structure at any cost. So struct page contains no less than three unions and is surrounded by complicated rules describing which fields are valid at which times. Changes to how this structure is accessed must be made with great care.

Unions are not the only technique used to shoehorn as much information as possible into this small structure. Non-uniform memory access (NUMA) systems need to track information on which node each page belongs to, and which zone within the node as well. Rather than add fields to struct page, the NUMA hackers grabbed the free bits at the top of the flags field, yielding something like this:

[Page
flags]

So, on a 32-bit system with 24 page flags defined (a pessimistic scenario), there are eight bits available for the node and zone information, practically limiting 32-bit NUMA systems to 64 nodes, which is almost certainly adequate. But the addition of more page flags would come at the cost of supporting fewer NUMA nodes, and that would be unwelcome.

Things get worse on systems with complicated physical memory layouts. On such systems, memory is not organized into a single, continuous range of physical addresses; instead, it is spread out with holes in the middle. Memory management on these "sparse memory" systems requires that each page have a "section" number associated with it. That section number is stored - you guessed it - in the spare bits at the top of the flags field. If space gets too tight, the kernel will move the node number into a separate array, slowing things down in the process. Either way, it seems clear that there is not a whole lot of spare room in the flags field on these systems.

So the real answer to "how many page flags are free?" is, for all practical purposes, "zero," at least on 32-bit NUMA systems. Making room for more would require expanding struct page, which is a heavy cost to pay. Developers should, thus, not be surprised when proposals to use new page flags run into stiff opposition. It's only one bit, but that bit is in the middle of some of the most sought-after real estate in the entire kernel.

In the case of Andi's HWPOISON patch, this opposition has come in the form of a number of alternative suggestions. One was to simply use the "reserved" bit, but that could lead to difficulties in parts of the code where that usage is not expected. Then it was suggested that the combination of the "reserved" and "writeback" flags could indicate a poisoned page, but Andi claims that this approach cannot work. Andrew Morton has suggested that HWPOISON could be made into a 64-bit-only feature; Andi allows as to how that might be possible, but he clearly doesn't like the idea.

Instead, Andi takes the position that the page flag shortage does not really exist. It's not a problem at all on 64-bit systems, where unsigned long is twice as wide. The number of 32-bit systems with a large number of NUMA nodes is small and shrinking; it's not something that the developers need be concerned about. And, says Andi, if things get really bad, the sparse memory section number can be moved into a separate array like the NUMA node number. Given this view of the problem, worries about adding a useful new feature over concerns about a single page flag bit seem misplaced.

Nobody has challenged Andi's view that the problem is not as severe as most people think, though Andrew Morton has hinted that Andi should go ahead and prove his ideas about moving the section number out of the page structure. That might not be a bad idea. Even if page flags are a little more abundant than most developers think, it still is not hard to foresee a time when they are exhausted, at least on 32-bit systems. Proposals involving new page flags are not particularly rare; unless we want to restrict features needing page flags to 64-bit systems, we'll need to make some more flags available before too long.

Comments (7 posted)

Xen again

By Jonathan Corbet
June 3, 2009
Your editor is widely known for his invariably correct and infallible predictions. So, certainly, he would never have said something like this:

Mistakes may have been made in Xen's history, but it is a project which remains alive, and which has clear reasons to exist. Your editor predicts that the Dom0 code will find little opposition at the opening of the 2.6.30 merge window.

OK, anybody needing any further evidence of your editor's ability to foresee the future need only look at his investment portfolio...or, shall we say, the smoldering remains thereof. Needless to say, Xen Dom0 support did not get through the 2.6.30 merge window, and it's not looking very good for 2.6.31 either.

Dom0, remember, is the hypervisor portion of the Xen system; it's the One Ring which binds all the others. Unlike the DomU support (used for ordinary guests), Dom0 remains outside of the mainline kernel. So anybody who ships it must patch it in separately; for a patch as large and intrusive as Dom0, that is not a pleasant task. It is a necessary one, though; Xen has a lot of users. As expressed by Xen hacker Jeremy Fitzhardinge:

Xen is very widely used. There are at least 500k servers running Xen in commercial user sites (and untold numbers of smaller sites and personal users), running millions of virtual guest domains. If you browse the net at all widely, you're likely to be using a Xen-based server; all of Amazon runs on Xen, for example. Mozilla and Debian are hosted on Xen systems.

Xen developers and users would all like to see that code merged into the mainline. A number of otherwise uninvolved kernel developers have also argued in favor of merging this code. So one might well wonder why there is still opposition.

One problem is a fundamental disagreement with the Xen design, which calls for a separate user-space hypervisor component. To some developers, it looks like an unfortunate mishmash of code in the mainline kernel, in Xen-specific kernel code, and in user space - with, of course, a set-in-concrete user-space ABI in the middle. Many developers are more comfortable with the fully in-kernel hypervisor approach taken by KVM. Thomas Gleixner is especially worried about the possible results of merging the Xen Dom0 code for this reason (among several others):

Aside of that it can also hinder the development of a properly designed hypervisor in Linux: 'why bother with that new stuff, it might be cleaner and nicer, but we have this Xen dom0 stuff already?'.

Steven Rostedt, who has worked on Xen in the past, also dislikes the hypervisor design and the effects it has on kernel development:

The major difference between KVM and Xen is that KVM _is_ part of Linux. Xen is not. The reason that this matters is that if we need to make a change to the way Linux works we can simply make KVM handle the change. That is, you could think of it as Dom0 and the hypervisor would always be in sync.

If we were to break an interface with Dom0 for Xen then we would have a bunch of people crying foul about us breaking a defined API. One of Thomas's complaints (and a valid one) is that once Linux supports an external API it must always keep it compatible. This will hamper new development in Linux if the APIs are scattered throughout the kernel without much thought.

Steven suggests merging the Xen hypervisor into the mainline so that it's all part of Linux, and to make the hypervisor ABI an internal, changeable interface. Some other developers - generally those most hostile to merging Dom0 in its current form - supported this idea. It's certainly not the first time that this sort of idea has been raised. But, despite many calls to bring some of the "plumbing layer" into the kernel proper, that has yet to happen; it seems unlikely that something as large as Xen would be the first user-space component to break through that barrier - even if the Xen developers were amenable to that approach.

The hypervisor design would probably not be an insurmountable obstacle to merging by itself. But there are other complaints. The maintainers of the x86 architecture dislike the changes made to their code by the Dom0 patches. By their reckoning, there are far too many "if (xen)..." conditionals and too many #ifdefs. They would very much like to see the Xen code cleaned up and made less intrusive into the core x86 code. Linus supports them on this point:

The fact is (and this is a _fact_): Xen is a total mess from a development standpoint. I talked about this in private with Jeremy. Xen pollutes the architecture code in ways that NO OTHER subsystem does. And I have never EVER seen the Xen developers really acknowledge that and try to fix it.

The Xen cause was also not helped by some performance numbers posted by Ingo Molnar. If you choose the right benchmark, it seems, you can show that the paravirt_ops layer imposes a 1% overhead on kernel performance. Paravirt_ops is the code which abstracts low-level machine operations; it can enable the same kernel to run either on "bare metal" or virtualized under a hypervisor. It adds a layer of indirect function calls where, before, inline code was used. Those function calls come at a cost which has now been quantified by Ingo (but one should note that Rusty Russell has shown that, with the right benchmark, a number of other common configuration options have a much higher cost).

The problem here is not that Xen users have a slower kernel; the real issue is that any kernel which might ever be run under Xen must be built with paravirt_ops enabled. There are few things which make distributors' lives more miserable than forcing them to build, ship, and support another kernel configuration. So most distributor kernels run with paravirt_ops enabled; that means that all users, regardless of whether they have any interest in Xen, pay the price. In some cases, that cost is too high; Nick Piggin said:

FWIW, we had to disable paravirt in our default SLES11 kernel. (admittedly this was before some of the recent improvements were made). But there are only so many 1% performance regressions you can introduce before customers won't upgrade (or vendors won't publish benchmarks with the new software).

Ingo is strongly critical of the perceived cost of paravirt_ops, but he also proposes a solution:

Note what _is_ acceptable and what _is_ doable is to be a bit more inventive when dumping this optional, currently-high-overhead paravirt feature on us. My message to Xen folks is: use dynamic patching, fix your hypervisor and just use plain old-fashioned _restraint_ and common sense when engineering things, and for heaven's sake, _care_ about the native kernel's performance because in the long run it's your bread and butter too.

He goes on to say that merging Dom0 now would only make things worse; it would give the Xen developers less incentive to fix the problems while, simultaneously, making it harder for distributors to disable paravirt_ops in their kernels.

And that, perhaps, leads to the fundamental disconnect in this discussion. There are two distinctive lines of thought with regard to when code with known problems should be merged:

  • Some developers point out that code which is in the mainline benefits from the attention of a much wider pool of developers and improves much more quickly. It is easy to find examples of code which, after languishing for years out of the mainline, improved quickly after being merged. This is the reasoning behind the -staging tree and the general policy toward merging drivers sooner rather than later.

  • Some developers - sometimes, amusingly, the same developers - say, instead, that the best time to get fundamental problems fixed is before merging. This is undoubtedly true for user-space ABI issues; those often cannot be fixed at all after they have been shipped in a stable kernel. But holding code out of the mainline is also a powerful lever which subsystem maintainers can employ to motivate developers to fix problems. Once the code is merged, that particular tool is no longer available.

Both of these themes run through the Xen discussion. There is no doubt that the Xen Dom0 code would see more eyeballs - and patches - after being merged. So some developers think that the right thing to do is to merge this much-requested feature, then fix it up afterward. Chris Mason put it this way:

The idea that we should take code that is heavily used is important. The best place to fix xen is in the kernel. It always has been, and keeping it out is just making it harder on everyone involved.

But the stronger voice looks to be the one saying that the problems need to be fixed first. The deciding factors seem to be (1) the user-space ABI, and (2) the intrusion into the core x86 code; those issues make Xen different from yet another driver or filesystem. That, in turn, suggests that the Dom0 code is not destined for the mainline anytime soon. Instead, the Xen developers will be expected to go back and fix a list of problems - a lot of work with an uncertain result at the end.

Comments (53 posted)

Page sanitization, part 2

By Jake Edge
June 3, 2009

Last week's Security page looked at some recently proposed patches that would "sanitize" kernel memory by clearing it as it was freed. At that time, a second version of the patches which unconditionally cleared memory when freed—dependent on the sanitize_mem boot parameter—was generally well received. But, perhaps folks just had not yet had a chance to look. Over the last week, multiple objections have been raised, which were mostly met with belligerent responses from developer Larry Highsmith. In many ways, this is starting to look like yet another lesson in "how not to work with the kernel community".

The basic problem is that data can persist in memory long after that memory is freed. Sometimes that data contains passwords, cryptographic keys, confidential documents, etc., but it is impossible for the kernel to know, in the general case, which pages are sensitive. By clearing memory when it is deallocated, the lifetime of this potentially sensitive data can be reduced. A research paper describes some experiments that showed memory values persisting for days and even weeks on Linux systems. A bug in the kernel that leaked memory information could potentially leak these values to attackers.

So, Highsmith proposed adding a memory sanitization feature that has long been a part of the patches applied to the kernel by the PaX security project. There is clearly a performance impact to clearing memory as it is reclaimed, but, since memory is cleared as it is allocated (to avoid obvious information leaks), the impact may not be as large as it seems at first glance. As Arjan van de Ven points out:

.. and if we zero on free, we don't need to zero on allocate. While this is a little controversial, it does mean that at least part of the cost is just time-shifted, which means it'll not be TOO bad hopefully...

Peter Zijlstra is concerned about the cache effects: "zero on allocate has the advantage of cache hotness, we're going to use the memory, why else allocate it. [...] zero on free only causes extra cache evictions for no gain." But van de Ven describes how he sees the caches being affected, concluding: "Don't get me wrong, I'm not arguing that zero-on-free is better, I'm just trying to point out that the 'advantage' of zero-on-allocate isn't nearly as big as people sometimes think it is..."

But some, like Alan Cox, think the performance impact is immaterial: "If you need this kind of data wiping then the performance hit is basically irrelevant, the security comes first." Zijlstra and others are concerned about the price that is paid by all kernel users, even those who have not enabled sanitize_mem. He notes that the patches would add extra function calls and branches even when the feature is not enabled. Suggestions were made to benchmark the proposed code against the existing implementation, but that is where the conversation started to go off the rails.

Highsmith obviously gets frustrated with the direction of the discussion, but rather than stepping back, he lashes out. There is certainly some provocation in the thread, Zijlstra's "Really, get a life, go fix real bugs. Don't make our kernel slower for wanking rights." comment certainly didn't help. But Highsmith needs to recognize that he is the one trying to get something added to the kernel, so the burden of "proof" is on him. Instead, his condescending manner seems to indicate that he feels like he is presenting the kernel community with a gift—one they are too slow-witted to understand.

An important characteristic for kernel contributors is that they work well with the rest of the community: answer questions, respond to code review suggestions, etc. When that doesn't happen, patches tend to be ignored, regardless of their technical merit, and Highsmith seems headed down that path. When it was suggested that using kzfree() on specific kernel allocations for sensitive data—which would clear the memory, then free it—Highsmith responded:

That's hopeless, and kzfree is broken. Like I said in my earlier reply, please test that yourself to see the results. Whoever wrote that ignored how SLAB/SLUB work and if kzfree had been used somewhere in the kernel before, it should have been noticed [a] long time ago.

Since Highsmith was responding to SLAB maintainer Pekka Enberg's suggestion, that response—even if true—probably wasn't the right approach. Enberg and others asked specifically about the problems in kzfree(), but the response from Highsmith was a combination of condescension and vagueness. As soon as Enberg and Ingo Molnar tried to pin down where those problems are, Highsmith went off on a rant about the SLOB memory allocator.

In addition, Molnar has pointed out that some of the same sensitive values can have long lifetimes on the kernel stack:

Long-lived tasks that touched any crypto path (or other sensitive data in the kernel) and leaked it to the kernel stack can possibly keep sensitive information there indefinitely (especially if that information got there in an accidentally deep stack context) - up until the task exits. That information will outlive the freeing and sanitizing of the original sensitive data.

Rather than recognize this as an additional area that needs addressing, Highsmith just continues his tirade:

But you and the other cabal of vagueness have only sent mostly useless comments, outright uncivil responses, obvious misdirection attempts, unfounded critics, etc. I haven't seen more fallacies put together since the last time I read an unreleased film script by Jerry Lewis.

Overall, the idea of clearing memory as it is freed based on a boot time flag is reasonable. Several kernel hackers, including Cox and Rik van Riel, have expressed interest in seeing the feature added. With some effort, it would seem that the performance cost for the disabled case could be reduced to an acceptable level, but if the main proponent is spending his time fighting and flaming, it seems unlikely that it will ever get merged.

A newer set of patches, which just use kzfree() in specific sensitive places (tty buffer management, 802.11 key handling, and the crypto API) were also proposed by Highsmith, but Linus Torvalds was not particularly impressed. There was no need to use kzfree() there, a simple memset() was sufficient. Torvalds was not necessarily a believer in the need for the patches, nor for how Highsmith responded to review:

but quite frankly, I'm not convinced about these patches at all.

I'm also not in the least convinced about how you just dismiss everybodys concerns.

There were some additional technical complaints about the patches as well, particularly the use of kzfree() everywhere in the crypto API patch. Crypto API maintainer Herbert Xu noted: "The zeroing of metadata is gratuitous." Overall, they had the look of being created grudgingly—as if it were a favor to do so.

Where things go from here is unclear. Highsmith seemed to possibly be signing off in his reply to Torvalds: "The next time a kernel vulnerability appears that is remotely related to some of the venues of attack I've commented, it will be useful to be able to refer to these responses." There is some justification for Highsmith's frustration, but he needs to see that it isn't going to do him (or the kernel) any good.

Kernel contributors, especially new ones, need to recognize that the community has folks that are at least as smart as they are. In this case, some of those developers may not have the security focus that Highsmith does, but that doesn't reduce their understanding of the kernel, nor their interest in seeing it have patches applied for better security. It would be unfortunate to see this feature, which could be very useful in some environments, fall by the wayside.

Comments (11 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Security-related

Virtualization and containers

Benchmarks and bugs

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds