Kernel development

Brief items

Kernel release status

The current development kernel is 3.6-rc1, announced on August 2. "As usual, even the shortlog is too big to usefully post, but there's the usual breakdown: about two thirds of the changes are drivers (with the CSR driver from the staging tree being a big chunk of the noise - christ, that thing is big and wordy even after some of the crapectomy). [...] Of the non-driver portion, a bit over a third is arch (arm, x86, tile, mips, powerpc, m68k), and the rest is a fairly even split among fs, include file noise, networking, and just 'rest'." See the summary below for what was merged after last week's update.

Stable updates: The 3.2.25 and 3.2.26 kernels were released on August 3 and August 5 respectively. The 3.2.27, 3.4.8, 3.0.40, and 3.5.1 stable reviews are underway as of this writing; those kernels can be expected on or after August 9.

Comments (none posted)

Quotes of the week

Trust me: every problem in computer science may be solved by an indirection, but those indirections are *expensive*. Pointer chasing is just about the most expensive thing you can do on modern CPU's.

— Linus Torvalds

When the GNU OS concept started the idea that everyone would have a Unix capable system on their desk was pretty hard to imagine. The choice of a Mach based microkernel was both in keeping with a lot of the research of the time and also had a social element. The vision was a machine where any user could for example implement their own personal file system without interfering with other users. Viewed in the modern PC world that sounds loopy but on a shared multi-user computer it was an important aspect of software freedom.

Sticking to Mach and being hostile to Linux wasn't very smart and a lot of developers have not forgiven the FSF for that, which is one reason they find the "GNU/Linux" label deeply insulting.

The other screw up was that they turned down the use of UZI, which would have given them a working if basic v7 Unix equivalent OS years before Linux was released. Had they done that Linux would never have happened and probably the great Windows battle would have been much more fascinating.

— History lessons from Alan Cox

Comments (2 posted)

The conclusion of the 3.6 merge window

By Jonathan Corbet
August 3, 2012

Linus closed the 3.6 merge window on August 2, a couple of days earlier than would have normally been expected. There were evidently two reasons for that: a desire to send a message to those who turn in their pull requests on the last day of the merge window, and his upcoming vacation. In the end, he only pulled a little over 300 changes since the previous merge window summary, with the result that 8,587 changes were pulled in the 3.6 merge window as a whole.

Those 300+ changes included the following:

The block I/O bandwidth controller has been reworked so that each control group has its own request list, rather than working from a single, global list. This increases the memory footprint of block I/O control groups, but makes them function in a manner much closer to the original intention when lots of requests are in flight.
A set of restrictions on the creation of hard and soft links has been added in an attempt to improve security; they should eliminate a lot of temporary file vulnerabilities.
The device mapper dm-raid module now supports RAID10 (a combination of striping and mirroring).
The list of new hardware support in 3.6 now includes OMAP DMA engines.
The filesystem freeze functionality has been reimplemented to be more robust; in-tree filesystems have been updated to use the new mechanism.

The process of stabilizing all of those changes now begins; if the usual patterns hold, the final 3.6 kernel can be expected sometime in the second half of September.

Comments (3 posted)

Kernel development news

Testing for kernel performance regressions

By Jonathan Corbet
August 3, 2012

It is not uncommon for software projects — free or otherwise — to include a set of tests intended to detect regressions before they create problems for users. The kernel lacks such a set of tests. There are some good reasons for this; most kernel problems tend to be associated with a specific device or controller and nobody has anything close to a complete set of relevant hardware. So the kernel depends heavily on early testers to find problems. The development process is also, in the form of the stable trees, designed to collect fixes for problems found after a release and to get them to users quickly.

Still, there are places where more formalized regression testing could be helpful. Your editor has, over the years, heard a large number of presentations given by large "enterprise" users of Linux. Many of them expressed the same complaint: they upgrade to a new kernel (often skipping several intermediate versions) and find that the performance of their workloads drops considerably. Somewhere over the course of a year or so of kernel development, something got slower and nobody noticed. Finding performance regressions can be hard; they often only show up in workloads that do not exist except behind several layers of obsessive corporate firewalls. But the fact that there is relatively little testing for such regressions going on cannot help.

Recently, Mel Gorman ran an extensive set of benchmarks on a set of machines and posted the results. He found some interesting things that tell us about the types of performance problems that future kernel users may encounter.

His results include a set of scheduler tests, consisting of the "starve," "hackbench," "pipetest," and "lmbench" benchmarks. On an Intel Core i7-based system, the results were generally quite good; he noted a regression in 3.0 that was subsequently fixed, and a regression in 3.4 that still exists, but, for the most part, the kernel has held up well (and even improved) for this particular set of benchmarks. At least, until one looks at the results for other processors. On a Pentium 4 system, various regressions came in late in the 2.6.x days, and things got a bit worse again through 3.3. On an AMD Phenom II system, numerous regressions have shown up in various 3.x kernels, with the result that performance as a whole is worse than it was back in 2.6.32.

Mel has a hypothesis for why things may be happening this way: core kernel developers tend to have access to the newest, fanciest processors and are using those systems for their testing. So the code naturally ends up being optimized for those processors, at the expense of the older systems. Arguably that is exactly what should be happening; kernel developers are working on code to run on tomorrow's systems, so that's where their focus should be. But users may not get flashy new hardware quite so quickly; they would undoubtedly appreciate it if their existing systems did not get slower with newer kernels.

He ran the sysbench tool on three different filesystems: ext3, ext4, and xfs. All of them showed some regressions over time, with the 3.1 and 3.2 kernels showing especially bad swapping performance. Thereafter, things started to improve, with the developers' focus on fixing writeback problems almost certainly being a part of that solution. But ext3 is still showing a lot of regressions, while ext4 and xfs have gotten a lot better. The ext3 filesystem is supposed to be in maintenance mode, so it's not surprising that it isn't advancing much. But there are a lot of deployed ext3 systems out there; until their owners feel confident in switching to ext4, it would be good if ext3 performance did not get worse over time.

Another test is designed to determine how well the kernel does at satisfying high-order allocation requests (being requests for multiple, physically-contiguous pages). The result here is that the kernel did OK and was steadily getting better—until the 3.4 release. Mel says:

This correlates with the removal of lumpy reclaim which compaction indirectly depended upon. This strongly indicates that enough memory is not being reclaimed for compaction to make forward progress or compaction is being disabled routinely due to failed attempts at compaction.

On the other hand, the test does well on idle systems, so the anti-fragmentation logic seems to be working as intended.

Quite a few other test results have been posted as well; many of them show regressions creeping into the kernel in the last two years or so of development. In a sense, that is a discouraging result; nobody wants to see the performance of the system getting worse over time. On the other hand, identifying a problem is the first step toward fixing it; with specific metrics showing the regressions and when they first showed up, developers should be able to jump in and start fixing things. Then, perhaps, by the time those large users move to newer kernels, these particular problems will have been dealt with.

That is an optimistic view, though, that is somewhat belied by the minimal response to most of Mel's results on the mailing lists. One gets the sense that most developers are not paying a lot of attention to these results, but perhaps that is a wrong impression. Possibly developers are far too busy tracking down the causes of the regressions to be chattering on the mailing lists. If so, the results should become apparent in future kernels.

Developers can also run these tests themselves; Mel has released the whole set under the name MMTests. If this test suite continues to advance, and if developers actually use it, the kernel should, with any luck at all, see fewer core performance regressions in the future. That should make users of all systems, large or small, happier.

Comments (40 posted)

A generic hash table

By Jake Edge
August 8, 2012

A data structure implementation that is more or less replicated in 50 or more places in the kernel seems like some nice low-hanging fruit to pick. That is just what Sasha Levin is trying to do with his generic hash table patch set. It implements a simple fixed-size hash table and starts the process of changing various existing hash table implementations to use this new infrastructure.

The interface to Levin's hash table is fairly straightforward. The API is defined in linux/hashtable.h and one declares a hash table as follows:

    DEFINE_HASHTABLE(name, bits)

This creates a table with the given name and a power-of-2 size based on bits. The table is implemented using buckets containing a kernel struct hlist_head type. It implements a chaining hash, where hash collisions are simply added to the head of the hlist. One then calls:

    hash_init(name, bits);

to initialize the buckets.

Once that's done, a structure containing a struct hlist_node pointer can be constructed to hold the data to be inserted, which is done with:

    hash_add(name, bits, node, key);

where node is a pointer to the hlist_node and key is the key that is hashed into the table. There are also two mechanisms to iterate over the table. The first iterates through the entire hash table, returning the entries in each bucket:

    hash_for_each(name, bits, bkt, node, obj, member)

The second returns only the entries that correspond to the key's hash bucket:

    hash_for_each_possible(name, obj, bits, node, member, key)

In each case, obj is the type of the underlying data, node is a struct hlist_head pointer to use as a loop cursor, and member is the name of the struct hlist_node member in the stored data type. In addition, hash_for_each() needs an integer loop cursor, bkt. Beyond that, one can remove an entry from the table with:

    hash_del(node);

Levin has also converted six different hash table uses in the kernel as examples in the patch set. While the code savings aren't huge (a net loss of 16 lines), they could be reasonably significant after converting the 50+ different fixed-size hash tables that Levin found in the kernel. There is also the obvious advantage of restricting all of the hash table implementation bugs to one place.

There has been a fair amount of discussion of the patches over the three revisions that Levin has posted so far. Much of it concerned implementation details, but there was another more global concern as well. Eric W. Biederman was not convinced that replacing the existing simple hash tables was desirable:

For a trivial hash table I don't know if the abstraction is worth it. For a hash table that starts off small and grows as big as you need it the [incentive] to use a hash table abstraction seems a lot stronger.

But, Linus Torvalds disagreed. He mentioned that he had been "playing around" with a directory cache (dcache) patch that uses a fixed-size hash table as an L1 cache for directory entries that provided a noticeable performance boost. If a lookup in that first hash table fails, the code then falls back to the existing dynamically sized hash table. The reason that the code hasn't been committed yet is because "filling of the small L1 hash is racy for me right now" and he has not yet found a lockless and race-free way to do so. So:

[...] what I really wanted to bring up was the fact that static hash tables of a fixed size are really quite noticeably faster. So I would say that Sasha's patch to make *that* case easy actually sounds nice, rather than making some more complicated case that is fundamentally slower and more complicated.

Torvalds posted his patch (dropped diff attachment) after a request from Josh Triplett. The race condition is "almost entirely theoretical", he said, so it could be used to generate some preliminary performance numbers. Beyond just using the small fixed-sized table, Torvalds's patch also circumvents any chaining; if the hash bucket doesn't contain the entry, the second cache is consulted. By avoiding "pointer chasing", the L1 dcache "really improved performance".

Torvalds's dcache work is, of course, something of an aside in terms of Levin's patches, but several kernel developers seemed favorably inclined toward consolidating the various kernel hash table implementations. Biederman was unimpressed with the conversion of the UID cache in the user namespace code and Nacked it. On the other hand, Mathieu Desnoyers had only minor comments on the conversion of the tracepoint hash table and Eric Dumazet had mostly stylistic comments on the conversion of the 9p protocol error table. There are several other maintainers who have not yet weighed in, but so far most of the reaction has been positive. Levin is trying to attract more reviews by converting a few subsystems, as he notes in the patch.

It is still a fair amount of work to convert the other 40+ implementations, but the conversion seems fairly straightforward. But, Biederman's complaint about the conversion of the namespace code is something to note: "I don't have the time for a new improved better hash table that makes the code buggier." Levin will need to prove that his implementation works well, and that the conversions don't introduce regressions, before there is any chance that we will see it in the mainline. There is no reason that all hash tables need to be converted before that happens—though it might make it more likely to go in.

Comments (21 posted)

Ask a kernel developer

August 8, 2012

This article was contributed by Greg Kroah-Hartman.

Here is another in our series of articles with questions posed to a kernel developer. If you have unanswered questions about technical or procedural things involving Linux kernel development, ask them in the comment section, or email them directly to the author. This time, we look at UEFI booting, real-time kernels, driver configuration, and building kernels.

I’d like to follow a mailing list on UEFI-booting-related topics but don’t seem to find any specific subsystem in the MAINTAINERS file, would you please share some pointers?

Because of the wide range of topics involved in UEFI booting, there is no "one specific" mailing list where you can track just the UEFI issues. I recommend filtering the fast-moving linux-kernel mailing list, as most of the topics that kernel developers discuss cross that list. As the kernel isn't directly involved in UEFI, there is no one specific "maintainer" of this area at the moment. That being said, there are lots of different people working on this task right now.

From the kernel side itself, there has been some wonderful work from Matt Flemming and other Intel developers, in making it so that the kernel can be built as an image that is bootable from EFI directly. There were some recent patches that went into the 3.6-rc1 kernel that have made it easier for bootloaders to load the kernel in EFI mode. See the patch for the details about how this is done, but note that some bootloader work is also needed to take advantage of this.

From the "secure boot" UEFI mode side, James Bottomley, chair of the Technical Advisory Board of the Linux Foundation (and kernel SCSI subsystem maintainer), has been working through a lot of the "how do you get a distribution to boot in secure mode" effort and documenting it all for all distributions to use. He's published his results, with code; I also recommend reading his previous blog posts about this topic for more information about the subject and how it pertains to Linux.

As for distribution-specific work, both Canonical and Red Hat have been working with the UEFI Forum to help make Linux work properly on UEFI-enabled machines. I recommend asking those companies about how they plan to handle this issue, on their respective mailing lists, if you are interested in finding out what they are planning to do. Other distributions are aware of the issue, but as of this point in time, I do not believe they are working with the UEFI Forum.

I am evaluating Linux for use as an operating system in a real-time embedded application, however, I find it hard to find recent data with respect to the real-time performance of Linux. Do you have, or know of someone who has, information on the real-time performance of the Linux kernel, preferably under various load conditions?

I get this type of question a lot, in various forms. The very simple answer is: "No, there is no data, you should evaluate it yourself on your hardware platform, with your system loads, to determine if it meets your requirements." And in reality, that's what you should be doing in the first place even if there were "numbers" published anywhere. Don't trust a vendor, or a project, to know exactly how you are going to be using the operating system. Only you know best, so only you know how to determine if it solves your problem or not.

So, go forth, download the code, run it, and see if it works. It's really that simple.

Note, if it doesn't work for you, let the developers know about it. If they don't know about any problems, then they can't fix them.

What is the best way to get configuration data into a driver? (This is paraphrased from many different questions all asking almost the same thing.)

In the past (i.e. 10+ years ago), lots of developers used module parameters in order to pass configuration options into a driver to control a device. That started to break down very quickly when multiple devices of the same type were in the same system, as there isn't a simple way to use module parameters for this.

When the sysfs filesystem was created, lots of developers started using it to help configure devices, as the individual devices controlled by a single driver are much easier to see and write values to. This works today, for simple sets of configuration options (such as calibrating an input device). But, for more complex types of configurations, the best thing to use is configfs (kernel documentation, LWN article), which was written specifically for this task. It handles ways to tie configurations to sysfs devices easily, and handles notifying drivers when things have been changed by the user. At this point in time, I strongly recommend using that interface for any reasonably complex configuration task that a driver or subsystem might need.

What is a good, fast and reliable way to compile a custom kernel for a system? In the past, people have used lspci, lsusb, and others, combined with the old autokernelconf tool, but that can be difficult, is there a better way?

As Linus pointed out a few weeks ago, configuring a kernel is getting more and more complex, with different options being needed by different distributions. The simplest way I have found to get a custom kernel up and running on a machine is to take a distribution-built kernel that you know works, and then use the "make localmodconfig" build option.

To use this option, first boot the distribution kernel, and plug in any devices that you expect to use on the system, which will load the kernel drivers for them. Then go into your kernel source directory, and run "make localmodconfig". That option will dig through your system and find the kernel configuration for the running kernel (which is usually at /proc/config.gz, but can sometimes be located in the boot partition, depending on the distribution). Then, the script will remove all options for kernel modules that are not currently loaded, stripping down the number of drivers that will be built significantly. The resulting configuration file will be written to the .config file, and then you can build the kernel and install it as normal. The time to build this stripped-down kernel should be very short, compared to the full configuration that the distribution provides.

Comments (10 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.6-rc1 ?

Steven Rostedt 3.4.7-rt15 ?

Ben Hutchings Linux 3.2.26 ?

Steven Rostedt 3.2.26-rt39 ?

Ben Hutchings Linux 3.2.25 ?

Steven Rostedt 3.2.24-rt38 ?

Steven Rostedt 3.0.39-rt59 ?

Architecture-specific

Huacai Chen MIPS: Add Loongson-3 based machines support. ?

Stefano Stabellini Introduce Xen support on ARM ?

Joerg Roedel Improve IRQ remapping abstraction in x86 core code ?

Core kernel code

Michel Lespinasse faster augmented rbtree interface ?

Michel Lespinasse rbtree based interval tree as a prio_tree replacement ?

Frederic Weisbecker cputime: Generic virtual based cputime accounting v2 ?

Rafael J. Wysocki PM: Suspend/resume and runtime PM for clock sources/clock event devices in PM domains ?

Chen RIFS scheduler ready for 3.5.x kernel ?

Sasha Levin generic hashtable implementation ?

Development tools

Andrew Vagin perf: Teach perf tool to profile sleep times (v2) ?

Jiri Olsa perf: Add backtrace post dwarf unwind ?

Device drivers

Or Gerlitz Add Ethernet IPoIB driver ?

Ming Lei firmware loader: introduce cache/uncache firmware ?

Jiang Liu ACPI based system device hotplug framework ?

Jiang Liu introduce PCI bus lock to serialize PCI hotplug operations ?

Maarten Lankhorst dma-fence: dma-buf synchronization (v7) ?

Jiri Slaby TTY buffer in tty_port -- prep no. 4 ?

Filesystems and block I/O

Tatyana Brokhman block: Adding ROW scheduling algorithm ?

Jeff Layton vfs: add the ability to retry on ESTALE to several syscalls ?

Memory management

Lai Jiangshan memory,numa: introduce MOVABLE-dedicated node and online_movable for hotplug ?

wency@cn.fujitsu.com memory-hotplug: hot-remove physical memory ?

Christoph Lameter Common [00/16] Sl[auo]b: Common code rework V8 ?

Rafael Aquini make balloon pages movable by compaction ?

Mel Gorman Improve hugepage allocation success rates under load ?

Networking

Eric W. Biederman sctp: Basic support for network namespaces ?

Miscellaneous

Stephen Hemminger iproute2 3.5.0 ?

Page editor: Jake Edge
Next page: Distributions>>