LWN.net Logo

Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.34-rc6, released on April 29. This prepatch includes a lot of fixes, supplemented by the VMware balloon driver (discussed briefly here in early April) and the ipeth driver which facilitates USB tethering to iPhones. The short-form changelog is in the announcement, or see the full changelog for the details.

Stable updates have been nonexistent over the last week.

Comments (none posted)

Quotes of the week

If you were using two processes then I'd cheerily blame the scheduler. Because blaming the scheduler for WeirdShitWhichBroke is usually correct.
-- Andrew Morton

The Red Hat Enterprise Linux 6 kernel includes numerous subsystems and enhancements from 2.6.34, as well as its predecessor versions. As a result, the Red Hat Enterprise Linux 6 kernel cannot be simply labeled as any particular upstream version. Rather, the Red Hat Enterprise Linux 6 kernel is a hybrid of the latest several kernel versions. And, as Red Hat provides regular updates over the lifecycle of the product, we expect that the Red Hat Enterprise Linux 6 kernel will incorporate selected features from future upstream kernels that have yet to be developed.
-- Red Hat Enterprise Linux Team

My problem is I'm incredibly busy at the moment, and I've already done Ubuntu a huge favor by spending ten minutes to do a quickie investigation. Ubuntu needs to learn that it can't rely on upstream developers to jump through flaming hoops on short notice before a LTS release deadline as a cost-saving mechanism to avoid hiring their own senior kernel engineers.
-- Ted Ts'o

Talk about high level designs rarely gets any traction, and often goes nowhere. Give us an example implementation so there is something concrete for us to sink our teeth into.
-- David Miller

Comments (18 posted)

A rough restart for checkpoints

By Jonathan Corbet
May 5, 2010
Back in February, the checkpoint/restart patch set was brought to the kernel mailing list with a request for inclusion in the -mm tree. That was immediately prior to the 2.6.34 merge window, so there were limited amounts of developer attention available for review. At that time, Andrew Morton suggested:

I'd suggest waiting until very shortly after 2.6.34-rc1 then please send all the patches onto the list and let's get to work.

The checkpoint/restart developers did post the the patches in March, to relatively little response. Shortly before the 2.6.35 merge window, they reposted the whole thing as a 100-patch series. Unsurprisingly, there have been some complaints about the massive mailing, but there is another outcome which is less fortunate: the patches are not being looked at.

That, too, is unsurprising. The amount of developer time available for patch review is insufficient in the best of times, and it gets worse as the merge window approaches. Even the most seasoned reviewer is going to be a bit intimidated by a 100-patch series which pokes its fingers into almost every part of the core kernel. Most of them will decide that they have more important things to do elsewhere.

So, once again, checkpoint/restart is likely to be put on hold until after the next merge window. After that, if it comes back in more manageable pieces, the developers might truly get to work.

Comments (6 posted)

De-bloating tracepoints

By Jonathan Corbet
May 5, 2010
Support for tracing in the Linux kernel has made great strides over the last couple of years. One of the key features of a mature tracing system, though, is a long list of well-defined, well-documented tracepoints which allow a system administrator to hook into kernel events without understanding the kernel code itself. The kernel has slowly been gaining those tracepoints, but, as Steven Rostedt has pointed out, there is a problem: each tracepoint adds something between 1KB and 5KB to the size of the kernel. When one starts to think about adding hundreds (or more) tracepoints, that overhead starts to add up.

Steven, of course, is as good a person as any to blame for this problem, so he has set out to fix it. His nine-part patch moves some information to shared locations and eliminates unneeded stuff; the result was a 100KB size reduction in the size of his kernel. Needless to say, this seems like a savings worth having; it makes it that much more likely that tracepoints will actually be enabled in production kernels.

Of course, most of us will have to take Steven's word for it that the patches make sense; they are written in that special dialect of C preprocessor macros that mere kernel hackers fear to touch. So most of us are likely to take the memory savings, but won't look too closely at how they are achieved.

Comments (7 posted)

Kernel development news

Cleancache and Frontswap

By Jonathan Corbet
May 4, 2010
Dan Magenheimer's transcendent memory patch was examined here last July. This patch creates a special class of memory which is not directly accessible to the rest of the kernel, allowing a number of special tricks to be played. Since then, transcendent memory has seemingly disappeared from view - until now, at least. Dan has returned with a pair of new abstractions - called "Cleancache" and "Frontswap" - each of which encapsulates a part of what transcendent memory does.

Cleancache is the less controversial of the two. Dan describes it as "a page-granularity victim cache for clean pages," which should be crystal-clear to most LWN readers. For those who need a few more words: Cleancache provides a place where the kernel can put pages which it can afford to lose, but which it would like to keep around if possible. A classic example is file-backed pages which are clean, so they can be recovered from disk if need be. The kernel can drop such pages with no data loss, but things will get slower if the page is needed in the near future and must be read back from disk.

In such situations, the kernel could, instead of dropping the page, put it into the Cleancache system with:

    int cleancache_put_page(struct page *page);

At some future point, if there is a need for the page, it can be retrieved with:

    int cleancache_get_page(struct page *page);

The key point is that there is never any guarantee that cleancache_get_page() will actually succeed in getting the page back. The Cleancache code (or whatever mechanism sits behind it) is free to drop the page at any time if it needs the memory for some other purpose. So Cleancache users must be prepared to fall back to the real backing store if cleancache_get_page() fails.

While Cleancache holds the page, it can do creative things with it. Pages with duplicate contents are not uncommon, especially in virtualized situations; often, significant numbers of pages contain only zeroes. The backing store behind Cleancache can detect those duplicates and store a single copy. Compression of stored pages is also possible; there is currently work afoot to implement ramzswap (CompCache) as a Cleancache backend. It might also be possible to use Cleancache as part of a solid-state cache in front of a normal rotating drive.

Dan's patches include the addition of hooks to commonly-used filesystems so that they will use Cleancache automatically.

The other half of the equation is Frontswap; unlike Cleancache, Frontswap is meant to deal with dirty pages that the kernel would like to get rid of. Once again, there is an interface for moving pages into and out of the system:

    int frontswap_put_page(struct page *page);
    int frontswap_get_page(struct page *page);

The rules are a bit different, though: Frontswap is not required to accept pages handed to it (so frontswap_put_page() can fail), but every page it accepts is guaranteed to be there later when the kernel asks to get it back.

Like Cleancache, Frontswap can play tricks with the stored pages to stretch its memory resources. The real purpose behind this mechanism, though, appears to be to enable a hypervisor to respond quickly to memory usage spikes in virtualized guests. Dan put it this way:

Frontswap serves nicely as an emergency safety valve when a guest has given up (too) much of its memory via ballooning but unexpectedly has an urgent need that can't be serviced quickly enough by the balloon driver.

Reviewers have been more skeptical of this mechanism. To some, it looks like a way for dealing with shortcomings in the balloon driver, which is already charged with implementing hypervisor decisions on how much memory is to be made available to guests. If that is the case, it seems like fixing the balloon driver might be the better approach. Dan's response is that balloon drivers cannot respond quickly to memory needs, and that regulating guest memory with a balloon driver can lead to swap storms. This is, apparently, a real problem encountered by virtualized systems in the field. If, instead, the hypervisor maintains a pool of pages for Frontswap, it can make them available quickly when the need arises, mitigating memory-related performance problems.

Beyond that, Avi Kivity complains that memory given to guests with Frontswap can never be recovered by the hypervisor if those guests choose to hang onto it. Since operating systems tend to be written to take advantage of all of the memory resources available to them, it seems possible that Frontswap memory could fill quickly and would stay full, leaving the hypervisor starving for memory while maintaining pages it cannot get rid of. Avi also dislikes the page-at-a-time, synchronous nature of the Frontswap API. Dan's response here is that per-guest quotas will keep any guest from using too much Frontswap space and that the API is better suited to the problem being solved.

Complaints notwithstanding, Cleancache and Frontswap already appear to be in reasonably wide use; they are shipping in OpenSUSE 11.2, Oracle's VM virtualization product, and with Xen. Such distribution certainly stretches the "upstream first" rule somewhat, but it also shows that there is apparently a real use case for these features. Given that the patches are not particularly intrusive and that the features have no cost if they are not used, it seems that something along these lines should make it into the mainline sooner or later.

Comments (1 posted)

Reworking pm_qos

By Jonathan Corbet
May 4, 2010
Aggressive power management is increasingly used to reduce the power requirements of our systems. Sometimes, though, power management can, through the creation of excessive latencies, get in the way of work which needs to be done. One way to avoid problems is to have latency-sensitive parts of the kernel express their requirements, which can then be taken into account by the power management code. Tracking these requirements is the task of the pm_qos ("power management quality of service") code. Chances are that pm_qos will see a significant API change in 2.6.35.

The pm_qos code currently defines three quality of service parameters for which requirements may be specified: CPU latency (PM_QOS_CPU_DMA_LATENCY), network response latency (PM_QOS_NETWORK_LATENCY), and network throughput (PM_QOS_NETWORK_THROUGHPUT). The first two are specified in microseconds; throughput is specified in KB/sec. Currently, CPU latency requirements are observed by the cpuidle subsystem, and network latency is observed only by the mac80211 layer. Any requests for a minimum network throughput will fall on deaf ears in current kernels; given the effectiveness of asking your editor's ISP for better service, one assumes that the ignoring of throughput requests is simply a clever elimination of useless work by the networking hackers.

The API for specifying quality of service parameters is:

    #include <linux/pm_qos_params.h>

    int pm_qos_add_requirement(int qos, char *name, s32 value);
    int pm_qos_update_requirement(int qos, char *name, s32 value);
    void pm_qos_remove_requirement(int qos, char *name);

For each of the above functions, qos is one of the parameters listed above, name identifies the subsystem specifying the requirement, and value is the new requirement. The name string is used to identify a specific request in pm_qos_update_requirement() and pm_qos_remove_requirement(); it must match the value given when the requirement was first added.

Kernel code which may make decisions affecting quality of service should pay attention to the current requirements. There are two ways of doing that, one of which being to just ask pm_qos what the tightest requirement in effect is:

    int pm_qos_requirement(int qos);

The alternative is to register a notifier which is called whenever a given requirement changes, using:

    int pm_qos_add_notifier(int qos, struct notifier_block *notifier);
    int pm_qos_remove_notifier(int qos, struct notifier_block *notifier);

This API has been around for some time, though it remains lightly used within the kernel. One complaint which has been made is that the use of strings to identify requirements leads to inefficient behavior: changing a requirement involves walking a list and doing a bunch of string comparisons. Requirements are, by their nature, specified by latency-sensitive code, so it makes sense that the process should be fast. The use of arbitrary strings also opens up a distant possibility of confusion should two developers accidentally choose the same name.

In response to these problems, pm_qos hacker Mark Gross has proposed some changes to the API. With the new version, "requirements" would become "requests," and the use of strings to identify them would be removed. The new API for the specification of requirements requests is:

    struct pm_qos_request_list *pm_qos_add_request(int qos, s32 value);
    void pm_qos_update_request(struct pm_qos_request_list *pm_qos_req,
			       s32 new_value);
    void pm_qos_remove_request(struct pm_qos_request_list *pm_qos_req);

The pm_qos_request_list structure type is opaque to callers; it serves only as a handle to identify a specific request. Changes and removals can now be done with no list traversals and no string comparisons. On the other side, pm_qos_requirement() becomes pm_qos_request(), but the API is otherwise unchanged.

This change seems uncontroversial, and it should address the criticisms which have been made against this API. Unless something surprising happens, the new API will probably be merged for 2.6.35.

Comments (2 posted)

Kernel development statistics for 2.6.34 and beyond

By Jonathan Corbet
May 4, 2010
As of this writing, the current kernel prepatch is 2.6.34-rc6. A couple more prepatches are most likely due before the final release, but the number of changes to be found there should be small. In other words, 2.6.34 is close to its final form, so it makes sense to take a look at what has gone into this development cycle. In a few ways, 2.6.34 is an unusual kernel.

This kernel has seen the addition of 9100 non-merge changesets from just over 1100 developers. That makes it somewhat smaller than its predecessors, as can be seen in this table:

KernelPatchesDevs
2.6.29 11,600 1170
2.6.30 11,700 1130
2.6.31 10,600 1150
2.6.32 10,800 1230
2.6.33 10,500 1150
2.6.34 9,100 1110

Developer participation in this development cycle was slightly lower than the usual, but not in any significant way. But, it seems, those developers had a bit less than usual that they needed to get done. One might be tempted to chalk that up to the shorter-than-usual merge window at the beginning of this cycle, but the fact of the matter is that Linus let enough new material in after 2.6.34-rc1 to make the merge window effectively as long as it ever was.

The lists of the most active developers suggest that perhaps something else was going on: many of the developers who traditionally put large amounts of code into the kernel essentially sat out this cycle.

Most active 2.6.34 developers
By changesets
Sage Weil2122.3%
Joe Perches1691.9%
Paul Mundt1531.7%
Uwe Kleine-König1091.2%
Mark Brown1021.1%
Ben Dooks961.1%
Rafał Miłecki881.0%
Dan Carpenter840.9%
Alex Deucher830.9%
H Hartley Sweeten800.9%
Christoph Hellwig750.8%
Johannes Berg740.8%
Arnaldo Carvalho de Melo720.8%
Bartlomiej Zolnierkiewicz640.7%
David S. Miller630.7%
Magnus Damm630.7%
By changed lines
Sage Weil302334.1%
Vladislav Zolotarov231193.2%
Jarod Wilson196892.7%
Mark Brown185132.5%
Dimitris Michailidis139191.9%
Manuel Lauss118311.6%
Jörn Engel108101.5%
Kukjin Kim101421.4%
Alex Deucher97851.3%
Amit Kumar Salecha93911.3%
Michael Chan93361.3%
Joe Perches87381.2%
Paul Mundt84381.2%
Haojian Zhuang84031.1%
Magnus Damm83201.1%
Matthias Benesch77391.1%

Sage Weil jumped to the top of both lists with the merger of the Ceph distributed filesystem and the subsequent bug-fixing activity. Joe Perches is the new king of the trivial patch; his work includes lots of checkpatch fixups, reworking print statements in network drivers, and no less than 37 patches implementing a rather belated cleanup of the floppy driver. Paul Mundt's work falls almost exclusively within his role as the maintainer of the Super-H architecture. Uwe Kleine-König works mostly within the ARM architecture code, and Mark Brown continues as the source of large amounts of sound driver and embedded processor code.

On the "lines changed" side, Vladislav Zolotarov only contributed nine patches, all with the Broadcom NetXtreme II driver - but they included a large replacement of the in-tree firmware. Jarod Wilson's count was even smaller - three patches; he contributed the Broadcom Crystal HD driver to the staging tree. Dimitris Michailidis earned his place on the list with the new Chelsio Communications T4 Ethernet driver.

Just over 180 employers were identified as having contributed to 2.6.34 - almost exactly the same as 2.6.33. With the 2.6.33 summary, your editor suggested that Red Hat's position as the top contributor may soon be threatened; let's see how that prediction worked out for 2.6.34:

Most active 2.6.34 employers
By changesets
(None)145516.0%
(Unknown)95910.5%
Red Hat93410.3%
Intel4725.2%
IBM3543.9%
Novell3293.6%
(Consultant)2743.0%
Nokia2482.7%
New Dream Network2372.6%
Renesas Technology1882.1%
Texas Instruments1802.0%
Pengutronix1541.7%
Oracle1441.6%
HP1281.4%
(Academia)1251.4%
Analog Devices1231.4%
AMD1211.3%
Fujitsu1211.3%
Marvell1201.3%
Wolfson Microelectronics1011.1%
By lines changed
Red Hat7523510.3%
(None)7516010.3%
(Unknown)675419.2%
Broadcom565957.7%
Intel331754.5%
New Dream Network315014.3%
(Consultant)291404.0%
Novell242173.3%
Wolfson Microelectronics206602.8%
Renesas Technology162052.2%
Chelsio139371.9%
IBM136181.9%
QLogic131821.8%
MSC Vertriebs GmbH125451.7%
Samsung122241.7%
Marvell119141.6%
Texas Instruments112281.5%
Analog Devices110471.5%
AMD108941.5%
Nokia102171.4%

Looking at absolute numbers, Red Hat's contributions declined considerably from 2.6.33: 1223 changesets dropped to 934. Everybody else declined even further, though; Intel's changeset count was less than half of its value from 2.6.33. So Red Hat stays firmly at the top of the list. Many of the other companies on the list will be unsurprising, but readers may be forgiven for wondering about New Dream Network; that is a business co-founded by Ceph developer Sage Weil.

If we look at non-author signoffs, we get a view of who the most active gatekeepers for the kernel are. Here, there are no surprises at all:

Most non-author signoffs
By developer
David S. Miller103413.0%
Greg Kroah-Hartman7809.8%
Andrew Morton5466.9%
John W. Linville5466.9%
Ingo Molnar3484.4%
Mauro Carvalho Chehab3304.2%
James Bottomley2443.1%
Dave Airlie1501.9%
Ralf Baechle1441.8%
H. Peter Anvin1411.8%
By employer
Red Hat286536.1%
Novell129316.3%
Intel5657.1%
Google5476.9%
(None)3654.6%
IBM2893.6%
(Consultant)1942.4%
Wind River1451.8%
Atomide1301.6%
Oracle1281.6%

Ten development cycles ago (2.6.24), Andrew Morton was the most active gatekeeper, signing off on almost 1700 patches. His role as subsystem maintainer of last resort has declined over the years as more maintainers manage their own repositories and push patches directly to Linus. Speaking of Linus, he not only didn't make the list above, but he wasn't even close: his 71 signoffs put him in the 22nd position. Dave Airlie's position on the list is an indication of how much activity we are currently seeing in the graphics area.

Once again, over 50% of the patches heading into the mainline kernel pass through the hands of somebody employed by either Red Hat or Novell.

Looking forward

As of this writing, the opening of the 2.6.35 merge window can be expected sometime in the next 1-3 weeks. By the stated rules of the kernel development process, the bulk of the code intended for that merge window should already be in the linux-next tree. With that in mind, your editor pulled down the May 4 edition of linux-next to see what was up. There are currently 5144 non-merge changesets in that tree, representing 758 developers. The top contributors are:

Most active linux-next developers
By changesets
Mauro Carvalho Chehab2454.8%
Eric Paris1032.0%
Alexander Graf841.6%
Johannes Berg591.1%
Juuso Oikarinen591.1%
Jean-François Moine581.1%
Luis R. Rodriguez581.1%
Greg Kroah-Hartman521.0%
Sujith521.0%
Dan Carpenter511.0%
By changed lines
Mauro Carvalho Chehab287436.2%
Eliot Blennerhassett184294.0%
Bob Beers117032.5%
Luis R. Rodriguez105072.3%
Steve Wise94472.0%
Viresh Kumar94262.0%
Jason Wessel87391.9%
Sjur Braendeland86851.9%
Stephen Rothwell79081.7%
Matthias Benesch77391.7%

Mauro Carvalho Chehab has had a busy development cycle; beyond large amounts of Video4Linux work, he's jumped into the Nehelem EDAC (memory error detection and correction) code and is adding a new core for the management of infrared controllers. Eric Paris has done a bunch of security cleanup work; he also has the fanotify subsystem queued up. Eliot Blennerhassett, instead, has a single patch: a driver for AudioScience sound devices.

It will be interesting to see how this list changes by the end of the 2.6.35 merge window. Even more interesting, arguably, will be the list of top non-author signoffs:

Most non-author signoffs (linux-next)
Mauro Carvalho Chehab65113.8%
John W. Linville50710.8%
David Miller4629.8%
Greg Kroah-Hartman4118.7%
Ingo Molnar1703.6%
Avi Kivity1563.3%
James Bottomley1553.3%
Reinette Chatre982.1%
David Woodhouse932.0%
Marcelo Tosatti721.5%

Subsystem maintainers are the folks who are charged with getting work into linux-next, so, if they all are doing their jobs, this list should not change much through the merge window.

If the numbers do hold, 2.6.35 looks like another relatively subdued development cycle without huge amounts of exciting new stuff. Things do tend to change during the merge window, though, and surprises always show up from somewhere. So, even with resources like linux-next, it's hard to tell what the next development cycle will truly bring.

Comments (14 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Virtualization and containers

Benchmarks and bugs

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds