Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.37-rc4, released on November 29. "As suspected, spending most of the week in Japan made some kernel developers break out in gleeful shouts of 'let's send Linus patches when he is jet-lagged, and see if we can confuse him even more than usual'. As a result -rc4 has about twice the commits that -rc3 had." It's still mostly fixes, though; see the announcement for the short-form changelog, or the full changelog for all the details.

Stable updates: there have been no stable updates released over the last week.

Comments (none posted)

Quotes of the week

Yeah, restricting information is always a double edged sword - and by locking down we are implicitly assuming that the number of people trying to do harm is larger than the number of people trying to help. It is probably true though - and the damage they can inflict is becoming more and more serious (financially, legally and socially - and, in some cases, physically) with every year of humanity moving their lives to the 'net.

-- Ingo Molnar

Well yes. We take something which will fail occasionally and with your patch replace it with something which will fail a bit more often. Why don't we go all the way and do something which will fail *even more often*. Namely, just delete the damn function in the hope that the resulting failures will provoke the ext4 crew into doing something sane this time?

-- Andrew Morton

Comments (none posted)

The Linux Foundation's kernel development report

The Linux Foundation has announced the annual update of its report on kernel development. There is little there that will be new to LWN readers, but, in the humble opinion of your editor (who is one of the authors), it is a good summary of the situation. "This paper documents a bit less frenzied development than the last one, which was expected given all the new features of 2.6.30 (ext4, ftrace, btrfs, perf etc) as well as the peak of merged drivers from Linux stable tree. Regardless, this report continues to paint a picture of a very strong and vibrant development community."

Comments (none posted)

Restricting /proc/kallsyms - again

By Jonathan Corbet
December 1, 2010

During the 2.6.37 merge window, a change was merged which made /proc/kallsyms unreadable by unprivileged users by default. That change was subsequently reverted when it was found to break the bootstrap process on an older Ubuntu release. A new form of the patch has returned which fixes that problem - but it still may not be merged.

The new patch is quite simple: if the process reading the file lacks the CAP_SYS_ADMIN capability, /proc/kallsyms appears to be an empty file. It has been confirmed that this version of the patch no longer breaks user space. But there were complaints anyway: rather than restricting access to the file with the usual access control bits, this patch encodes a policy (CAP_SYS_ADMIN) into the kernel where it cannot be changed. That rubs a number of people the wrong way, so this patch probably will not go in either. Instead, concerned administrators (or distributors) will need to simply change the permissions on the file at boot time.

Comments (1 posted)

Structure holes and information leaks

By Jonathan Corbet
December 1, 2010

Many of the kernel security vulnerabilities reported are information leaks - passing the contents of uninitialized memory back to user space. These leaks are not normally seen to be severe problems, but the potential for trouble always exists. An attacker may be able to find a sequence of operations which puts useful information (a cryptographic key, perhaps) into a place where the kernel will leak it. So information leaks should be avoided, and they are routinely fixed when they are found.

Many information leaks are caused by uninitialized structure members. It can be easy to forget to assign to all members in all paths, or, possibly, the form of the structure might change over time. One way to avoid that possibility is to use something like memset() to clear the entire structure at the outset. Kernel code uses memset() in many places, but there are places where that is seen as an expensive and unnecessary call; why clear a bunch of memory which will be assigned to anyway?

One way of combining operations is with a structure initialization like:

    struct foo {
        int bar, baz;
    } f = {
    	.bar = 1,
    };

In this case, the baz field will be implicitly set to zero. This kind of declaration should ensure that there will be no information leaks involving this structure. Or maybe not. Consider this structure instead:

    struct holy_foo {
	short bar;
	long baz;
    };

On a 32-bit system, this structure likely contains a two-byte hole between the two members. It turns out that the C standard does not require the compiler to initialize holes; it also turns out that GCC duly leaves them uninitialized. So, unless one knows that a given structure cannot have any holes on any relevant architecture, structure initializations are not a reliable way of avoiding uninitialized data.

There has been some talk of asking the GCC developers to change their behavior and initialize holes, but, as Andrew Morton pointed out, that would not help for at least the next five years, given that older compilers would still be in use. So it seems that there is no real alternative to memset() when initializing structures which will be passed to user space.

Comments (14 posted)

Kernel development news

Conditional tracepoints

By Jonathan Corbet
November 30, 2010

Tracepoints are small hooks placed into kernel code; when they are enabled, they can generate event information which can be consumed through the ftrace or perf interfaces. These tracepoints are defined via the decidedly gnarly TRACE_EVENT() macro which Steven Rostedt nicely described in detail for LWN earlier this year. As kernel developers add more tracepoints to the kernel, they are occasionally finding things which can be improved. One of those seems relatively simple: what if a tracepoint should only fire some of the time?

Arjan van de Ven recently posted a patch adding a tracepoint to __mark_inode_dirty(), a function called deep within the virtual filesystem layer to, surprisingly, mark an inode as being dirty. Arjan's purpose is to figure out which processes are causing files to have dirty contents; that will allow tools like PowerTop to tell laptop users which process is causing their disk to spin up. The only problem is that some calls to __mark_inode_dirty() are essentially noise from this point of view; they happen, for example, when an inode is first created or is being freed. Tracing those calls could create a stream of useless events which would have to be filtered out by PowerTop, causing PowerTop itself to require more power. So it is preferable to avoid creating those events in the first place if possible.

For that reason, Arjan made the call to the tracepoint be conditional:

    if (flags & (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES))
	trace_writeback_inode_dirty(inode, flags);

This code works in that it causes the tracepoint to be "hit" only when an application has actually done something to dirty an inode.

The VFS developers seem to have no objection to this tracepoint being added; the resulting information can be useful. But they didn't like the conditional nature of it. Part of the problem is that tracepoints are supposed to keep a low profile; developers want to be able to ignore them most of the time. Expanding a tracepoint to two lines and an if statement rather defeats that goal. But tracepoints are also supposed to not affect execution time. They have been carefully coded to impose almost no overhead when they are not enabled (which is most of the time); with techniques like jump label, that overhead can be reduced even further. But that if statement, being outside of the tracepoint altogether, will always be executed regardless of whether the tracepoint is currently enabled or not. Multiply that test-and-jump across millions of calls to __mark_inode_dirty() on each of millions of machines, and the extra CPU cycles start to add up.

So it was asked: could this test be moved into the tracepoint itself? One approach might be to put the test into the TP_fast_assign() portion of the tracepoint, which copies the tracepoint data into the tracing ring buffer. The problem with that idea is that, by that time, the tracepoint has already fired, space has been allocated in the ring buffer, etc. There is currently no mechanism to cancel a tracepoint hit partway through. There has, in the past, been talk of adding some sort of "never mind" operation which could be invoked within TP_fast_assign(), but that idea seems less than entirely elegant.

What might happen, instead, is the creation of a variant of TRACE_EVENT() with a name like TRACE_EVENT_CONDITION(). It would take an extra parameter which would be, of course, another tricky macro. For Arjan's tracepoint, the condition would look something like:

    TP_CONDITION(
	    if (flags & (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES))
	    	return 1;
	    else
	    	return 0;
    ),

The tracepoint code would then test the condition before doing any other work associated with the tracepoint - but only if the tracepoint itself has been enabled.

This solution should help to keep the impact of tracepoints to a minimum once again, especially when those tracepoints are not enabled. There is one potential problem in that the condition is now hidden deeply within the definition of the tracepoint; that definition is usually found in a special header file far from the code where the tracepoint is actually inserted. At the tracepoint itself, the condition which might cause it not to fire is not visible in any way. So, if somebody other than the initial developer wants to use the tracepoint, they could misinterpret a lack of output as a sign that the surrounding code is not being executed at all. That little problem could presumably be worked around with clever tracepoint naming, better documentation, or simply expecting users to understand what tracepoints are actually telling them.

Comments (5 posted)

The best way to throw blocks away

By Jonathan Corbet
December 1, 2010

An old-style rotating disk drive does not really care if any specific block contains useful data or not. Every block sits in its assigned spot (in a logical sense, at least), and the knowledge that the operating system does not care about the contents of any particular block is not something the drive can act upon in any way. More recent storage devices are different, though; they can - in theory, at least - optimize their behavior if they know which blocks are actually worth hanging onto. Linux has a couple of mechanisms for communicating that knowledge to the block layer - one added for 2.6.37 - but it's still not clear which of those is best.

So when might a block device want to know about blocks that the host system no longer cares about? The answer is: just about any time that there is a significant mapping layer between the host's view of the device and the true underlying medium. One example is solid-state storage devices (SSDs). These devices must carefully shuffle data around to spread erase cycles across the media; otherwise, the device will almost certainly fail prematurely. If an SSD knows which blocks the system actually cares about, it can avoid copying the others and make the best use of each erase cycle.

A related technology is "thin provisioning," where a storage array claims to be much larger than it really is. When the installed storage fills, the device can gently suggest that the operator install more drives, conveniently available from the array's vendor. In the absence of knowledge about disused blocks, the array must assume that every block that has ever been written to contains useful data. That approach may sell more drives in the short term, but vendors who want their customers to be happy in the long term might want to be a bit smarter about space management.

Regardless of the type of any specific device, it cannot know about uninteresting blocks unless the operating system tells it. The ATA and SCSI standard committees have duly specified operations for communicating this formation; those operations are often called "trim" or "discard" at the operating system level. Linux has had support for trim operations for some time in the block layer; a few filesystems (and the swap code) have also been modified to send down trim commands when space is freed up. So Linux should be in good shape when it comes to trim support.

The only problem is that on-the-fly trim (also called "online discard") doesn't work that well. On some devices, it slows operation considerably; there's also been some claims that excessive trimming can, itself, shorten drive life. The fact that the SATA version of trim is a non-queued operation (so all other I/O must be stopped before a trim can be sent to the drive) is also extremely unhelpful. The observed problems have been so widespread that SCSI maintainer James Bottomley was recently heard to say:

However, I think it's time to question whether we actually still want to allow online discard at all. Most of the benchmarks show it to be a net lose to almost everything (either SSD or Thinly Provisioned arrays), so it's become an "enable this to degrade performance" option with no upside.

The alternative is "batch discard," where a trim operation is used to mark large chunks of the device unused in a single operation. Batch discard operations could be run from the filesystem code; they could also run periodically from user space. Using batch discard to run trim on every free space extent would be a logical thing to do after an fsck run as well. Batching discard operations implies that the drive does not know immediately when space becomes unused, but it should be a more performance- and drive-friendly way to do things.

The 2.6.37 includes a new ioctl() command called FITRIM which is intended for batch discard operations. The parameter to FITRIM is a structure describing the region to be marked:

    struct fstrim_range {
	uint64_t start;
	uint64_t len;
	uint64_t minlen;
    };

An ioctl(FITRIM) call will instruct the filesystem that the free space between start and start+len-1 (in bytes) should be marked as unused. Any extent less than minlen bytes will be ignored in this process. The operation can be run over the entire device by setting start to zero and len to ULLONG_MAX. It's worth repeating that this command is implemented by the filesystem, so only the space known by the filesystem to be free will actually be trimmed. In 2.6.37, it appears that only ext4 will have FITRIM support, but other filesystems will certainly get that support in time.

Batch discard using FITRIM should address the problems seen with online discard - it can be applied to large chunks of space, at a time which is convenient for users of the system. So it may be tempting to just give up on online discard. But Chris Mason cautions against doing that:

At any rate, I definitely think both the online trim and the FITRIM have their uses. One thing that has burnt us in the past is coding too much for the performance of the current crop of ssds when the next crop ends up making our optimizations useless. This is the main reason I think the online trim is going to be better and better.

So the kernel developers will probably not trim online discard support at this time. No filesystem enables it by default, though, and that seems unlikely to change. But if, at some future time, implementations of the trim operation improve, Linux should be ready to use them.

Comments (7 posted)

The kernel and the C library as a single project

By Jonathan Corbet
November 30, 2010

The kernel has historically been developed independently of anything that runs in user space. The well-defined kernel ABI, built around the POSIX standard, has allowed for a nearly absolute separation between the kernel and the rest of the system. Linux is nearly unique, however, in its division of kernel and user-space development. Proprietary operating systems have always been managed as a single project encompassing both user and kernel space; other free systems (the BSDs, for example) are run that way as well. Might Linux ever take a more integrated approach?

Christopher Yeoh's cross-memory attach patch was covered here last September. He recently sent out a new version of the patch, wondering, in the process, how he could get a response other than silence. Andrew Morton answered that new system calls are increasingly hard to get into the mainline:

We have a bit of a track record of adding cool-looking syscalls and then regretting it a few years later. Few people use them, and maybe they weren't so cool after all, and we have to maintain them for ever. Bugs (sometimes security-relevant ones) remain undiscovered for long periods because few people use (or care about) the code.

Ingo Molnar jumped in with a claim that the C library (libc) is the real problem. Getting a new feature into the kernel and, eventually, out to users takes long enough. But getting support for new system calls into the C library seems to take much longer. In the meantime, those system calls languish, unused. It is possible for a suitably motivated developer to invoke an unsupported system call with syscall(), but that approach is fiddly, Linux-specific, and not portable across architectures (since system call numbers can change from one architecture to the next). So most real-world use of syscall() is probably due to kernel developers testing out new system calls.

But, Ingo said, it doesn't have to be that way:

If we had tools/libc/ (mapped by the kernel automagically via the vDSO), where people could add new syscall usage to actual, existing, real-life libc functions, where the improvements could thus propagate into thousands of apps immediately, without requiring any rebuild of apps or even any touching of the user-space installation, we'd probably have _much_ more lively development in this area.

Ingo went on to describe some of the benefits that could come from a built-in libc. At the top of the list is the ability to make standard libc functions take advantage of new system calls as soon as they are available; applications would then get immediate access to the new calls. Instrumentation could be added, eventually integrating libc and kernel tracing. Perhaps something better could have been done with asynchronous I/O. And so on. He concluded by saying "Apple and Google/Android understands that single-project mentality helps big time. We don't yet."

As of this writing, nobody has responded to this suggestion. Perhaps it seems too fantastical, or, perhaps, nobody is reading the cross-memory attach thread. But it is an interesting idea to ponder on.

In the early days of Linux kernel development, the purpose was to create an implementation of a well-established standard for which a great deal of software had already been written. There was room for discussion about how a specific system call might be implemented between the C library and the kernel, but the basic nature of the task was well understood. At this point, Linux has left POSIX far behind; that standard is fully implemented and any new functionality goes beyond it. New system calls are necessarily outside of POSIX, so taking advantage of them will require user-space changes that, say, a better open() implementation would not. But new features are only really visible if and when libc responds by making use of them and by making them available to applications. The library most of us use (glibc) has not always been known for its quick action in that regard.

Turning libc into an extension of the kernel itself would short out the current library middlemen. Kernel developers could connect up and make use of new system calls immediately; they would be available to applications at the same time that the kernel itself is. The two components would presumably, over time, work together better. A kernel-tied libc could also shed a lot of compatibility code which is required if it must work properly with a wide range of kernel versions. If all went well, we could have a more tightly integrated libc which offers more functionality and better performance.

Such a move would also raise some interesting questions, naturally, starting with "which libc?" The obvious candidate would be glibc, but it's a large body of code which is not universally loved. The developers of whichever version of libc is chosen might want to have a say in the matter; they might not immediately welcome their new kernel overlords. One would hope that the ability to run the system with an alternative C library would not be compromised. Picking up the pace of libc development might bring interesting new capabilities, but there is also the ever-present possibility of introducing new regressions. Licensing could raise some issues of its own; an integrated libc would have to remain separate enough to carry a different license.

And, one should ask, where would the process stop? Putting nethack into the kernel repository might just pass muster, but, one assumes, Emacs would encounter resistance and LibreOffice is probably out of the question.

So a line needs to be drawn somewhere. This idea has come up in the past, and the result has been that the line has stayed where it always was: at the kernel/user-space boundary. Putting perf into the kernel repository has distorted that line somewhat, though. By most accounts, the perf experiment has been a success; perf has evolved from a rough utility to a powerful tool in a surprisingly short time. Perhaps an integrated C library would be an equally interesting experiment. Running that experiment would take a lot of work, though; until somebody shows up with a desire to do that work, it will continue to be no more than a thought experiment.

Comments (113 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.37-rc4 ?

Stefan Bader Linux 2.6.32.26+drm33.12 ?

Architecture-specific

Sebastian Andrzej Siewior Add device tree support for x86 ?

Daniel Drake OLPC: Add XO-1 suspend/resume support ?

Core kernel code

Tejun Heo ptrace,signal: sane interaction between ptrace and job control signals ?

Nick Piggin rcu-walk and dcache scaling ?

Michael Holzheu taskstats: Improve cumulative time accounting ?

Christoph Lameter Upgrade of this_cpu_ops V3 ?

jacob.jun.pan@linux.intel.com per cgroup timer slack and freezer duty cycle ?

Development tools

shaohui.zheng@intel.com NUMA Hotplug Emulator(v5) - Feedbacks & Responses ?

Luck, Tony persistent store (version 2) (part 1 of 2) ?

Stephane Eranian perf: add csv-style output to perf stat (v2) ?

Device drivers

Rafael J. Wysocki ACPI / PM: Rework power resources management ?

Laurent Pinchart Media controller (core and V4L2) ?

Laurent Pinchart V4L2 subdev userspace API ?

Laurent Pinchart Sub-device pad-level operations ?

Laurent Pinchart OMAP3 ISP driver ?

manjunatha_halli@ti.com FM V4L2 drivers for WL128x ?

Will Newton dw_mmc: Add Synopsys DesignWare mmc host driver. ?

David Sin TI DMM-TILER driver ?

Sylwester Nawrocki [GIT PATCHES FOR 2.6.38 0/2] I2C/subdev driver for NOON010PC30 camera chip ?

Documentation

Jan Engelhardt Xtables2 Netlink spec ?

Shawn Bohrer perf documentation updates ?

Filesystems and block I/O

Li Zefan btrfs: Readonly snapshots ?

Charles Manning Add yaffs2 file system: Third patchset ?

Memory management

Peter Zijlstra mm: Preemptibility -v6 ?

Minchan Kim f/madivse(DONTNEED) support ?

Huang Ying Lockless memory allocator and list ?

Mel Gorman Prevent kswapd dumping excessive amounts of memory in response to high-order allocations ?

Ying Han memcg: per cgroup background reclaim ?

Security-related

Jarkko Sakkinen Smack: label for task objects (retry 2) ?

Serge E. Hallyn Define CAP_SYSLOG ?

Virtualization and containers

Joerg Roedel KVM: Make the instruction emulator aware of Nested Virtualization ?

Miscellaneous

Lasse Collin Decompressors: Add XZ decompressor module ?

Karel Zak util-linux without -ng ?

Page editor: Jonathan Corbet
Next page: Distributions>>