Kernel development

Brief items

Kernel release status

The current 2.6 development kernel is 2.6.26-rc6, released by Linus on June 12. "I'd like to say that the diffs are shrinking and things are calming down, but I'd be lying. Another week, another -rc, and I another 350 commits." See the long-format changelog for all the details.

As of this writing, some 140 commits have gone into the mainline git repository since the 2.6.26-rc6 release. They include a number of fixes and a new driver for FM3130 realtime clocks.

The current -mm tree is 2.6.26-rc5-mm3. Says Andrew: "The aim here is to get all the stupid bugs out of the way so that some serious MM testing can be performed." Among other things, this release contains the latest version of the pageout scalability patches (see below).

The current stable 2.6 kernel is 2.6.25.7, released on June 16. It contains a rather long list of important fixes.

Comments (6 posted)

Kernel development news

Why some drivers are not merged early

By Jonathan Corbet
June 18, 2008

Arjan van de Ven's kernel oops report always makes for interesting reading; it is a quick summary of what is making the most kernels crash over the past week. It thus points to where some of the most urgent bugs are to be found. Sometimes, though, this report can raise larger issues as well. Consider the June 16 report, which notes that quite a few kernel crashes were the result of a not-quite-ready wireless update shipped by Fedora. Ingo Molnar was quick to jump on this report with a process-related complaint:

i suspect Fedora has done this to enable more hardware, and/or to fix mainline wireless bugs? I wish we would do such new driver merging in mainline instead, so that we had a single point of testing and single point of effort.

Same for Nouveau: Fedora carries it and i dont understand why such a major piece of work is not done in mainline and not _helped by_ mainline.

He then took the discussion further with this observation:

That's my main point: when we mess up and dont merge OSS driver code that was out there in time - and we messed up big time with wireless - we should admit the screwup and swallow the bitter pill.

This comment drew some unhappy responses from the networking developers, who feel that they have been unfairly targeted for criticism. Wireless drivers have been merged at the first real opportunity, they say, and trying to put them in earlier would have only made things worse. In fact, your editor will submit that mistakes were made with wireless drivers, but those mistakes have little to do with delaying their inclusion into the mainline. What went wrong with wireless is this:

Early wireless developers did not really try to solve the wireless networking problem; they just wanted to get their adaptor to work. Wireless maintainer John Linville once told your editor that, for years, these adaptors were treated as if they were Ethernet adaptors, which they certainly are not. When these developers did get around to dealing with issues specific to wireless networking, they created their own wireless stacks contained within their drivers. So no general wireless framework was created.
It's only in 2004 that Jeff Garzik started a project to create a generic wireless stack for Linux - and he started with a stack (HostAP) which, sometime later on, was seen as not being the best choice. So the work on HostAP - late to begin in the first place - was eventually abandoned.
The networking stack which was eventually developed - mac80211 - began its life as a proprietary code base created with no community review or oversight at all. Predictably, it had all kinds of problems which required well over a year of work to resolve. Until mac80211 was in reasonable shape, there was no real way to get drivers ready for inclusion.

The result of all this (and the occasional legal hassle as well) is that wireless networking on Linux lagged for years, and is only now reaching something close to a stable state. So it is not surprising that there has been a lot of code churn in this area, or that things occasionally break. But it is hard to see how trying to merge wireless drivers sooner would have helped the situation significantly.

The non-merging of the Nouveau driver - the reverse-engineered driver for NVIDIA adapters - also has a simple explanation: the developers have not yet asked for this merge to happen. Nouveau is not considered to be at a point where it works yet, and, importantly, there are still user-space API issues which must be worked out. Breaking user-space code is severely frowned upon, so merging of code is nearly impossible if its user-space interfaces are still in flux.

James Bottomley put forward another reason why a driver may stay out of the mainline even though the author would like to see it merged:

For the record, my own view is that when a new driver does appear we have a limited time to get the author to make any necessary changes, so I try to get it reviewed and most of the major issues elucidated as soon as possible. However, since the only leverage I have is inclusion, I tend to hold it out of tree until the problems are sorted out.

In other words, their control over access to the mainline tree is the one club subsystem maintainers have at hand when they feel the need to push a developer to make changes to a driver. It may well be that simply merging drivers regardless of technical objections (something which a number of developers are pushing for) will reduce the incentive for developers to get their code into top shape - and it's not always clear that others will step in and do the work for them.

On the other hand, the idea that in-tree code tends to be less buggy than out-of-tree code is relatively uncontroversial. So, for many drivers at least, a "merge first and fix it up later" policy may well lead to the best results in the shortest period of time. One thing that is clear is that this discussion will not be going away anytime soon; chances are good that this year's kernel summit (happening in September) will end up revisiting the issue.

Comments (5 posted)

Peter Zijlstra: From DOS to kernel hacking

By Jake Edge
June 18, 2008

In a linux-kernel thread about fixing the Kernel Janitors project, Peter Zijlstra spoke up, with a bit of his perspective on attracting better kernel contributors. As he is a relatively recent addition to the kernel community, his path from Linux user to kernel hacker may serve as a template of sorts for others who are starting out now. We asked Peter to answer a few questions by email to help fill in some more of the details.

LWN: How did you get started with Linux? What attracted you?

Peter: Around the time Win95 came around, IIRC [if I remember correctly]. I used to do demo coding on DOS, which involved rebooting your machine every time you messed up, and whereas DOS reboots quite quickly, doing the same on Win95 was anything but quick.

A friend of mine introduced me to Unix/Linux at the time, and I started learning all about programming in a real environment. Basically all programming up to that point was in a freestanding environment where you had to poke the hardware to get anything done.

So initially it was the charm of a proper multitasking OS (with memory protection) that got me to use it – not having to reboot your machine every time, and the luxury of being able to run a debugger.

LWN: How quickly did you start poking around in the kernel? What did you first start to look at and why?

Peter: The kernel ... well that took a seriously long while. The above introduction to Linux was around 95/96 IIRC. My first real kernel patches were about 10 years later.

In those 10 years I learnt a lot about programming. I learnt about Unix system programming, I learnt about C++, multi-threading, database engines, and a whole range of interesting things.

Somewhere along I got a real internet connection and started lurking on mailing lists, including LKML – I must have been reading that on and off for about 5 years by the time I really sat down and wrote some patches.

During that time I might have sent in some trivial build fixes, and I remember finding a priority leak in one of the realtime patches. But I wasn't actively coding on the kernel – I just liked running real exotic stuff, you know Gentoo and building just about everything from CVS.

So what got me started on the kernel ... I can't quite remember how it happened, but I ran into some of Rik's [van Riel] Advanced Page Replacement stuff. I had worked on that problem space earlier while doing database engines, and had recently run into it again at work. So I started reading those papers and some of the proposed kernel patches, and I started to itch.

I dropped basically everything I was working on in my spare time (hacking WindowMaker, writing a C++ ASN.1-DER serialization class, writing a new LDAP server and I'm sure some other projects that are rotting away on a harddrive somewhere :-) and started hacking.

Why ... I'm not sure – it sure got me back to where I started out – crashing machines (and boot times haven't improved over those past 10 years at all).

I think because of the challenge – I knew I could write whatever it was I was coding and this page replacement stuff was a whole new challenge, and TBH [to be honest] the kernel code didn't look too hard at the time (phew how ignorant I was..)

LWN: How well were your contributions received by kernel hackers? Did you make any missteps along the way?

Peter: Some better than others. I think its natural for every kernel hacker to grow a huge pile of discarded patches. Not everything will make it. But don't get discouraged by that, you did get to learn something from doing them.

Mis-steps, feh, still do ;-) Unlike most people seem to think, kernel hackers are human too.

LWN: What suggestions do you have for folks that are looking at getting involved in kernel hacking today?

Peter: Just do it – seriously it's that easy. Oh and don't be afraid of criticism, you'll get it anyway – in spades. Criticism is not personal, it's about your patch, there are two things you can do:

take it and act upon it
convince the other he's wrong

OK it can get personal, but that is only if you repeatedly fail the above two points.

LWN: There has been a lot of talk about the Kernel Janitors project recently, do you think that is a good way to get started with kernel development? What do you think should be done differently in that (or other) project(s) to attract more and better contributors?

Peter: I'm not sure. The Kernel Janitors thing doesn't really seem to work out. I think that might be due to two things:

we don't have enough simple but interesting things lined up (not saying there are none, but we don't have a ready list). I think a proper challenging project would be much better that moronic code clean ups.
the kernel really isn't a place for newbies; now let me explain this before it gets all mis-interpreted :-)

Things really get a lot easier if you're fairly competent at (Unix) system programming before starting at the kernel.
Kernel hacking is a solitary business in that you need to do things, nobody is going to do them for you. That is not saying nobody can help you if you have a question. Also, nobody is going to force you to do something – you need to want doing it.

Now, none of this means you can't start hacking the kernel without knowing C or any programming it all, but you'd better be ready for one hell of a ride (Yes, there are people who learnt C from doing kernel stuff, but that is going to take a serious amount of will-power to pull off).

So I guess what I'm saying is that you need to really want to do it. There is no other way to become a kernel hacker than by simply doing it.

LWN: Do you work on Linux for your job, as a hobby, or both?

Peter: Both; initially it was spare time besides $JOB. But after keeping this up for about a year my wife nudged me to look for a kernel job, since I obviously enjoyed hacking the kernel more than $JOB, and she'd get some of that spare time back ;-)

So I applied for a kernel position at a few of the larger vendors, and Red Hat won the race.

Already having had a year's worth of exposure to kernel code and LKML, certainly helped in getting this amazing opportunity. Have I already mentioned I absolutely love working on the kernel?

So now I get to poke at the kernel all day, every day...

LWN: What are your current kernel projects? What kinds of things do you see yourself doing in the kernel in the future?

Peter: Current active projects are group scheduling and some -rt work. I should pick up the swap over network code again, and there are some other loose ends.

The future ... well we'll see what happens, loads of interesting stuff to do.

We would like to thank Peter for taking the time to answer our questions.

Comments (2 posted)

The state of the pageout scalability patches

By Jonathan Corbet
June 17, 2008

The virtual memory scalability improvement patch set overseen by Rik van Riel has been under construction for well over a year; LWN last looked at it in November, 2007. Since then, a number of new features have been added and the patch set, as a whole, has gotten closer to the point where it can be considered for mainline inclusion. So another look would appear to be in order.

One of the core changes in this patch set remains the same: it still separates the least-recently-used (LRU) lists for pages backed up by files and those backed up by swap. When memory gets tight, it is generally preferable to evict page cache pages (those backed up by files) rather than anonymous memory. File-backed pages are less likely to need to be written back to disk and they are more likely to be well laid-out on disk, making it quicker to read them back in if necessary. Current Linux kernels keep both types of pages on the same LRU list, though, forcing the pageout code to scan over (potentially large numbers of) pages which it is not interested in evicting. Rik's patch improves this situation by splitting the LRU list in two, allowing the pageout code to only look at pages which might actually be candidates for eviction.

There comes a point, though, where anonymous pages need to be reclaimed as well. The kernel will make an effort to pick the best pages to evict by going for those which have not been recently referenced. Doing that, however, requires going through the entire list of anonymous pages, clearing the "referenced" bit on each. A large system can have many millions of anonymous pages; iterating over the entire set can take a long time. And, as it turns out, it's not really necessary.

The VM scalability patch set now changes that behavior by simply keeping a certain percentage of the system's anonymous pages on the inactive list - the first place the system looks for pages to evict. Those pages will drift toward the front of the list over time, but will be returned to the active list if they are used. Essentially, this patch is applying a form of the "referenced" test to a portion of anonymous memory - whether or not anonymous pages are being evicted at the time - rather than trying to check the referenced state of all anonymous pages when the kernel decides it needs to reclaim some of them.

Another set of patches addresses a different situation: pages which cannot be evicted at all. These pages might have been locked into memory with a system call like mlock(), be part of a locked SYSV shared memory region, or be part of a RAM disk, for example. They can be either page cache or anonymous pages. Either way, there is little point in having the reclaim code scan them, since it will not be possible to evict them. But, of course, the current reclaim code does have to scan over these pages.

This unneeded scanning, as it turns out, can be a problem. The extensive unevictable LRU document included with the patch claims:

For example, a non-numal x86_64 platform with 128GB of main memory will have over 32 million 4k pages in a single zone. When a large fraction of these pages are not evictable for any reason [see below], vmscan will spend a lot of time scanning the LRU lists looking for the small fraction of pages that are evictable. This can result in a situation where all cpus are spending 100% of their time in vmscan for hours or days on end, with the system completely unresponsive.

Most of us are not currently working with systems of this size; one must spend a fair amount of money to gain the benefits of this sort of pathological behavior. Still, it seems like something which is worth fixing.

The solution, of course, is yet another list. When a page is determined to be unevictable, that page will go onto the special, per-zone unevictable list, after which the pageout code will simply not see it anymore. As a result of the variety of ways in which a page can become unevictable, the kernel will not always know at mapping time whether a specific page can go onto the unevictable list or not. So the pageout code must keep an eye out for those pages as it scans for reclaim candidates and shunt them over to the unevictable list as they are found. In relatively short order, the locked-down pages will accumulate in this list, freeing the pageout code to concentrate on pages it can actually do something about.

Many of the concerns which have been raised about this patch set over the last year have been addressed. A few remain, though. Some of the new features require new page flags; these flags are in extremely short supply, so there is always pressure to find ways of implementing things which do not allocate more of them. There are a few too many configuration options and associated #ifdef blocks. And so on. Addressing these may take a while, but convincing everybody that these (rather fundamental) memory management changes are beneficial under all circumstances may take rather longer. So, while this patch set is making progress, a 2.6.27 merge is probably not in the cards.

Comments (2 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.26-rc6 ?

Andrew Morton 2.6.26-rc5-mm3 ?

Greg KH Linux 2.6.25.7 ?

Architecture-specific

Thomas Gleixner AMD C1E aware idle support ?

Core kernel code

Oleg Nesterov workqueues: insert_work: use "list_head *" instead of "int tail" ?

Oleg Nesterov workqueues: implement flush_work() ?

Development tools

Jason Baron dynamic debug ?

Eduard - Gabriel Munteanu relay interface bugfixes and early logging support ?

Vegard Nossum kmemcheck: divide and conquer ?

Ingo Molnar ftrace, v16 ?

Andy Whitcroft update checkpatch to version 0.20 ?

Device drivers

FUJITA Tomonori fix per-device dma_mapping_ops support ?

Matthew Garrett Clean up thermal API ?

Borislav Petkov misc generic ide stuff ?

David Brownell gpio: sysfs interface (updated) ?

David Altobelli HP iLO driver ?

steiner@sgi.com GRU Driver V2 - Overview ?

Jean Delvare i2c: Add detection capability to new-style drivers ?

Documentation

Michael Kerrisk man-pages-3.00 is released ?

Olaf Kirch Gentle Guide to the Network Stack ?

Filesystems and block I/O

Evgeniy Polyakov POHMELFS high performance network filesystem. First steps in parallel processing. ?

Martin K. Petersen Block Layer Data Integrity ?

Josef Bacik add proper ACL support to btrfs ?

Benjamin Thery sysfs tagged directories V6 ?

Memory management

Rik van Riel VM pageout scalability improvements (V12) ?

Networking

Luis R. Rodriguez First CRDA integration work ?

Security-related

Andrew G. Morgan capabilities: refactor kernel filesystem capability support ?

David Howells Introduce credentials [ver #3] ?

Benchmarks and bugs

Rafael J. Wysocki 2.6.26-rc6-git2: Reported regressions from 2.6.25 ?

Arjan van de Ven Oops report for the week preceding June 16th, 2008 ?

Miscellaneous

Kay Sievers udev 124 release ?

Page editor: Jonathan Corbet
Next page: Distributions>>