User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 development kernel remains 2.6.30-rc3; no new prepatches have been released over the last week. Changes continue to flow into the mainline repository; they are almost all fixes, but there was also the restoration of Tux as the kernel mascot.

The current stable 2.6 kernel is, released on April 27 with about 100 patches. "There are a lot of fixes in this release touching all over the tree. At least a few have possible security impact (e.g. af_rose, agp, capability fs_mask, splice/ocfs2). As usual, you're encouraged to upgrade."

Comments (none posted)

Kernel development news

Quotes of the week

Hey, it was only meant to be a single release. Now they can all die as far as I'm concerned.
-- Linus Torvalds returns Tux to the kernel

In days of old in 2.6.29, netfilter did locketh using a lock of the reader kind when doing its table business, and do a writer when with pen in hand like a overworked accountant did replace the tables. This sucketh and caused the single lock to fly back and forth like a poor errant boy.

But then netfilter was blessed with RCU and the performance was divine, but alas there were those that suffered for trying to replace their many rules one at a time.

So now RCU must be vanquished from the scene, and better chastity belts be placed upon this valuable asset most dear. The locks that were but one are now replaced by one per suitor.

The repair was made after much discussion involving Eric the wise, and Linus the foul. With flowers springing up amid the thorns some peace has finally prevailed and all is soothed.

-- Stephen Hemminger

First of all, I happily admit that wrt locking I'm a barbarian, and proud of it. I.e. simpler locking scheme beats theoretical improvement, unless we have really good evidence that there's a real-world problem. All things equal, complexity loses. All things not quite equal - ditto.
-- Al Viro

Comments (3 posted)

USB and fast booting

By Jake Edge
April 29, 2009

The changes that are being made for a faster-booting Linux have generally been welcomed, but when they lead to an apparent regression, complaints will be heard. That situation arose recently when Jeff Garzik reported a regression that caused one of his systems to no longer boot. Because of some changes made to the way USB initializes, the system no longer recognized his disks in time to mount the root filesystem from them. As it turns out, the problem is not limited to disks, nor is it new; it is a longstanding race condition that previously was being "won" by most hardware, but that same hardware is often losing the race now.

Garzik had bisected the problem to a particular commit made back in September of 2008. Instead of sleeping for 100ms as part of the initialization of each USB root hub, the new code uses the delayed work mechanism to schedule the next initialization step 100ms in the future. For kernels which had the USB code built-in, this would allow the boot thread to do other work, rather than block waiting for these delays. It had a rather positive impact on boot speed, with patch author Alan Stern reporting:

Arjan van de Ven was very pleased to see that this shaved 700 ms off his computer's boot time. Since his total boot time is on the order of two seconds, the improvement is considerable.

From Garzik's perspective, the problem is that this system booted successfully with every kernel version until 2.6.28. The immediate suggestion was to use the rootdelay kernel boot option which will delay the boot process for the given number of seconds before trying to mount the root filesystem. That did not sit very well with Garzik, and he asked: "When did regressions become an acceptable tradeoff for speed?"

As it turns out, Garzik had just been "lucky" before, he could have run into this problem on earlier kernels with different hardware as Greg Kroah-Hartman points out: "What happens when you buy a new box with more USB host controllers and a faster processor? Same problem." The underlying issue is specific to USB, as the old initialization waited 100ms per USB bus (i.e. root hub) synchronously, so a system with five hubs would effectively wait 500ms for the first to initialize and enumerate the devices attached. The new code does those same initializations in parallel.

While it is relatively uncommon to have USB root filesystems, it is far from unheard of. Embedded systems are a fairly likely candidate, due to cost and form factor issues, as Alan Cox explained. Multiple distributions also have support for running completely from a USB device, typically a USB flash drive.

But, as Garzik and others point out, users that upgrade their kernels (or distributions who do so), but don't add in a rootdelay option, risk having systems that cannot boot. USB is fundamentally different than other buses, however, because there is no way to know when the enumeration of devices on a particular hub has been completed. Mark Lord questioned the explanation, noting: "SATA drives also take variable amounts of time to 'show up' at boot." But as Arjan van de Ven explained, there is a significant difference:

the difference is that with sata you know when you are done and have all possible drives. No so much much with USB. So with SATA we can, and do, wait for the scan to complete at the right point in the boot.

It turns out that the same problem in a slightly different guise shows up for embedded devices that use USB consoles. David VomLehn has been working on a patch to wait for USB consoles to become available. Because embedded devices often have USB consoles, but only for development and debugging, a long delay waiting for a console that is unlikely to show up in the majority of cases is undesirable. But, because it is impossible to know that all USB devices have reported in, some kind of delay is inevitable. VomLehn's mechanism would delay up until a timeout specified in the kernel boot parameters, but, unlike rootdelay, would wake up early as soon as a console device was detected.

As VomLehn notes, the problem goes even further than that, affecting USB network devices needed at boot time as well. Discussion on various versions of his patch also pointed out that similar problems exist for other buses. As boot parallelization gets better—and more pervasive—more of these kinds of problems are going to be discovered. A more general solution for devices required at boot time needs to be found as van de Ven describes:

For other pieces it's hard. Non-enumeratable busses just suck; at some point all you can do is just wait (which we have already available today for anyone to do). I realize people don't want to just wait 4 seconds (the people who first objected to boot time improvements then suddenly care about boot time ;-)...

For root fs there's some options, and I have patches to basically retry on fail. (The patches have a bug and I don't have time to solve it this week, so I'm not submitting them) For other devices it is hard. Realistically we need hotplug to work well enough so that when a device shows up, we can just hook it up when it does.

So far, the problems have just been identified and discussed. Workarounds like rootdelay have been mentioned, but that only "solves" one facet of the problem. Distributions are, or will be, shipping 2.6.29 kernels in their upcoming releases, one hopes they have already dealt with the issue or there may be a number of rather puzzled users with systems that don't boot. It would seem important to address the problems, at least for USB storage, as part of 2.6.31.

Comments (14 posted)

On the value of static tracepoints

By Jonathan Corbet
April 28, 2009
As has been well publicized by now, the Linux kernel lacks the sort of tracing features which can be found in certain other Unix-like kernels. That gap is not the result of a want of trying. In the past, developers trying to put tracing infrastructure into the kernel have often run into a number of challenges, including opposition from their colleagues who do not see the value of that infrastructure and resent its perceived overhead. More recently, it would seem that the increased interest in tracing has helped developers to overcome some of those objections; an ongoing discussion shows, though, that concerns about tracing are still alive and have the potential to derail the addition of tracing facilities to the kernel.

Sun's DTrace is famously a dynamic tracing facility, meaning that it can be used to insert tracepoints at (almost) any location in the kernel. But the Solaris kernel also comes with an extensive and well-documented set of static tracepoints which can be activated by name. These tracepoints have been placed at carefully-considered locations which facilitate investigations into what the kernel is actually doing. Many real-world DTrace scripts need only the static tracepoints and do no dynamic tracepoint insertion at all.

There is clear value in these static tracepoints. They represent the wisdom of the developers who (presumably) are the most familiar with each kernel subsystem. System administrators can use them to extract a great deal of useful information without having to know the code in question. Properly-placed static tracepoints bring a significant amount of transparency to the kernel. As tracing capabilities in Linux improve, developers naturally want to provide a similar set of static tracepoints. The fact that static tracing is reasonably well supported (via FTrace) in mainline kernels - with more extensive support available via SystemTap and LTTng - also encourages the creation of static tracepoints. As a result, there have been recent patches adding tracepoints to workqueues and some core memory management functions, among others.

Digression: static tracepoints

As an aside, it's worth looking at the form these tracepoints take; the design of Linux tracepoints gives a perspective on the problems they were intended to solve. As an example, consider the following tracepoints for the memory management code which reports on page allocations. The declaration of the tracepoint looks like this:

    #include <linux/tracepoint.h>

	TP_PROTO(unsigned long pfn, unsigned long free),

	TP_ARGS(pfn, free),

		__field(unsigned long, pfn)
		__field(unsigned long, free)

		__entry->pfn = pfn;
		__entry->free = free;

	TP_printk("pfn=%lx zone_free=%ld", __entry->pfn, __entry->free)

That seems like a lot of boilerplate for what is, in a sense, a switchable printk() call. But, naturally, there is a reason for each piece. The TRACE_EVENT() macro declares a tracepoint - this one is called mm_page_allocation - but does not yet instantiate it in the code. The tracepoint has arguments which are passed to at its actual instantiation (which we'll get to below); they are declared fully in the TP_PROTO() macro and named in the TP_ARGS() macro. Essentially, TP_PROTO() provides a function prototype for the tracepoint, while TP_ARGS() looks like a call to that tracepoint.

These values are enough to let the programmer place a tracepoint in the code with a line like:

			     zone_page_state(zone, NR_FREE_PAGES));

This tracepoint is really just a known point in the code which can have, at run time, one or more function pointers stored into it by in-kernel tracing utilities like SystemTap or Ftrace. When the tracepoint is enabled, any functions stored there will be called with the given arguments. In this case, enabling the tracepoint will result in calls whenever a page is allocated; those calls will receive the page frame number of the allocated page and the number of free pages remaining as parameters.

As can be seen in the declaration above, there's more to the tracepoint than those arguments; the rest of the information in the tracepoint declaration is used by the Ftrace subsystem. Ftrace has a couple of seemingly conflicting goals; it wants to be able to quickly enable human-readable output from a tracepoint with no external tools, but the Ftrace developers also want to be able to export trace data from the kernel quickly, without the overhead of encoding it first. And that's where the remaining arguments to TRACE_EVENT() come in.

When properly defined (the magic exists in a bunch of header files under kernel/trace), TP_STRUCT__entry() adds extra fields to the structure which represent the tracepoint; those fields should be capable of holding the binary parameters associated with the tracepoint. The TP_fast_assign() macro provides the code needed to copy the relevant data into that structure. That data can, with some changes merged for 2.6.30, be exported directly to user space in binary format. But, if the user just wants to see formatted information, the TP_printk() macro gives the format string and arguments needed to make that happen.

The end result is that defining a tracepoint takes a small amount of work, but using it thereafter is relatively easy. With Ftrace, it's a simple matter of accessing a couple of debugfs files. But other tools, including LTTng and SystemTap, are also able to make use of these tracepoints.

The disagreement

Given all the talk about tracing in recent years, there is clearly demand for this sort of facility in the kernel. So one might think that adding tracepoints would be uncontroversial. But, naturally enough, it's not that simple.

The first objection that usually arises has to do with the performance impact of tracepoints, which are often placed in the most performance-critical code paths in the kernel. That is, after all, where the real action happens. So adding an unconditional function call to implement a tracepoint is out of the question; even putting an if test around it is problematic. After literally years of work, the developers came up with a scheme involving run-time code patching that reduces the performance cost of an inactive tracepoint to, for all practical purposes, zero. Even the most performance-conscious developers have stopped fretting about this particular issue. But, of course, there are others.

A tracepoint exists to make specific kernel information available to user space. So, in some real sense, it becomes part of the kernel ABI. As an ABI feature, a tracepoint becomes set in stone once it's shipped in a stable kernel. There is not a universal agreement on the immutability of kernel tracepoints, but the simple fact is that, once these tracepoints become established and prove their usefulness, changing them will cause user-space tracing tools to break. That means that, even if tracepoints are not seen as a stable ABI the way system calls are, there will still be considerable resistance to changing them.

Keeping tracepoints stable when the code around them changes will be a challenge. A substantial subset of the developer community will probably never use those tracepoints, so they will tend to be unaware of them and will not notice when they break. But even a developer who is trying to keep tracepoints stable is going to run into trouble when the code evolves to the point that the original tracepoint no longer makes sense. One can imagine all kinds of cruft being added so that a set of tracepoints gives the illusion of a very different set of decisions than is being made in a future kernel; one can also imagine the hostile reception any such code will find.

The maintenance burden associated with tracepoints is the reason behind Andrew Morton's opposition to their addition. With regard to the workqueue tracepoints, Andrew said:

If someone wants to get down and optimise our use of workqueues then good for them, but that exercise doesn't require the permanent addition of large amounts of code to the kernel. The same amount of additional code and additional churn could be added to probably tens of core kernel subsystems, but what _point_ is there to all this? Who is using it, what problems are they solving?

We keep on adding all these fancy debug gizmos to the core kernel which look like they will be used by one person, once. If that!

Needless to say, the tracing developers see the code as being more widely useful than that. Frederic Weisbecker gave a detailed description of the sort of debugging which can be done with the workqueue tracepoints. Ingo Molnar's response appears to be an attempt to hold up the addition of other types of kernel instrumentation until the tracepoint issue is resolved. Andrew remains unconvinced, though; it seems he would rather see much of this work done with dynamic tracing tools instead.

As of this writing, that's where things stand. If these tracepoints do not get into the mainline, it is hard to see developers going out and creating others in the future. So Linux could end up without a set of well-defined static tracepoints for a long time yet - though it would not be surprising to see the enterprise Linux vendors adding some to their own kernels. Perhaps that is the outcome that the development community as a whole wants, but it's not clear that this feeling is universal at this time. If, instead, Linux is going to end up with a reasonable set of tracepoints, the development community will need to come to some sort of consensus on which kinds of tracing instrumentation is acceptable.

Comments (25 posted)

KSM tries again

By Jonathan Corbet
April 28, 2009
Back in November, LWN examined the KSM (kernel shared memory) patch. KSM was written to address a problem encountered on systems running virtualized guests: such systems can have large numbers of pages holding identical data, but the kernel has no way to let guests share those pages. The KSM code scans through memory to find pairs of pages holding identical data; when such pairs are found, they are merged into a single page mapped into both locations. The pages are marked copy-on-write, so the kernel will automatically separate them again should one process modify its data.

There were some concerns about the intended purpose of this patch, but it was soon made clear that KSM can help the system to save significant amounts of memory. But KSM was not merged as a result of two other problems. One of them, discussed mostly behind closed doors, appears to be concerns about the use of SHA1 hashes to compare pages. If an attacker could create hash collisions, he might just be able to inject his own data (or code) into processes belonging to other users and/or virtual machines. The other problem had to do with a different kind of attacker: VMWare holds a patent to an algorithm which looks quite similar to the method used by the early KSM patches. There is evidence that this patent could be overturned on prior art, but that is still a battle that nobody wants to fight.

KSM disappeared from view for a while after those issues came to light, but, more recently, new versions of the KSM patches have been posted for review. A quick look at the code makes it clear that both of these concerns have been addressed - and, in fact, that the KSM developers were able to kill both birds with the same stone. It's all a matter of doing away with the hashing of pages.

Patent 6,789,156 is not exactly light reading; it has a full 74 claims. Most of the independent claims have one thing in common, though: they include the calculation of a hash value to find identical pages in the system. If the KSM code were to avoid hashing pages, those claims of the patent would clearly not read against it. And, as described above, the use of hashing also created some security worries. So it must have seemed clear to the KSM developers (and any lawyers they may have talked to) that the hash needed to go.

The current KSM patches have replaced the hash table with two separate red-black trees. Pages tracked by KSM are initially stored in the "unstable tree"; the term "unstable" means that KSM considers their contents to be volatile. Placement in the tree is determined by a simple memcmp() of the page's contents; essentially, the page is treated as containing a huge number and sorted accordingly. The unstable tree is suitable for finding pages with duplicate contents; a relatively quick traversal of the tree will turn up the only candidates.

It's worth noting that KSM does not place every page it scans in the unstable tree. If the contents of a page change over the course of one memory scanning cycle, the page will not really be a good candidate for sharing anyway. So pages which are seen to change are not represented in the unstable tree. The unstable tree is also dumped and rebuilt from the beginning after each scan cycle. That deals with the problem of pages which, as a result of modifications, find themselves in the wrong location in the tree. The nature of red-black trees means that search and insertion operations are almost the same thing, so there is little real cost to rebuilding the unstable tree from the beginning every time.

The other pages which are not found in the unstable tree are those which have actually been merged with duplicates. Since shared pages are marked read-only, KSM knows that their contents cannot change. Those pages are put into a separate "stable tree." The stable tree is also a red-black tree, but, since pages cannot become misplaced there, it need not be rebuilt regularly. Once a page goes into the stable tree, it stays there until all users have either modified or unmapped it.

The resulting system clearly works. Dropping the hash may impose a cost in the form of slightly higher CPU and memory use; there have been no benchmarks posted which would show the difference. But there is no cost on systems which do not use KSM at all, and, in any case, avoiding the potential problems associated with using hash tables to identify pages with identical contents will be worth the cost. At this point, comments on the KSM code are mostly concerned with relatively small details. It could well be that this code will be ready for inclusion into the 2.6.31 kernel.

(Postscript: above, your editor noted that "most" of the independent claims in the VMWare patent required the use of a hashing mechanism. There are, in fact, a few claims which lack that requirement, but they replace it with one or two others. Some claims cover the use of copy-on-write pages, but they all explicitly say that this technique is used on pages with a "relatively high probability of impending modification." But there is little point in sharing such pages at all; KSM, by leaving them out of the unstable tree, ignores those pages entirely. The remaining claims describe partitioning memory into "classes," which is not done in KSM.)

Comments (21 posted)

Patches and updates

Kernel trees


Build system

Core kernel code

Development tools

Device drivers


Filesystems and block I/O


Virtualization and containers

  • Gregory Haskins: irqfd . (April 24, 2009)

Benchmarks and bugs


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds