Kernel development
Brief items
Kernel release status
The current development kernel is 2.6.34-rc6, released on April 29. This prepatch includes a lot of fixes, supplemented by the VMware balloon driver (discussed briefly here in early April) and the ipeth driver which facilitates USB tethering to iPhones. The short-form changelog is in the announcement, or see the full changelog for the details.Stable updates have been nonexistent over the last week.
Quotes of the week
A rough restart for checkpoints
Back in February, the checkpoint/restart patch set was brought to the kernel mailing list with a request for inclusion in the -mm tree. That was immediately prior to the 2.6.34 merge window, so there were limited amounts of developer attention available for review. At that time, Andrew Morton suggested:
The checkpoint/restart developers did post the the patches in March, to relatively little response. Shortly before the 2.6.35 merge window, they reposted the whole thing as a 100-patch series. Unsurprisingly, there have been some complaints about the massive mailing, but there is another outcome which is less fortunate: the patches are not being looked at.
That, too, is unsurprising. The amount of developer time available for patch review is insufficient in the best of times, and it gets worse as the merge window approaches. Even the most seasoned reviewer is going to be a bit intimidated by a 100-patch series which pokes its fingers into almost every part of the core kernel. Most of them will decide that they have more important things to do elsewhere.
So, once again, checkpoint/restart is likely to be put on hold until after the next merge window. After that, if it comes back in more manageable pieces, the developers might truly get to work.
De-bloating tracepoints
Support for tracing in the Linux kernel has made great strides over the last couple of years. One of the key features of a mature tracing system, though, is a long list of well-defined, well-documented tracepoints which allow a system administrator to hook into kernel events without understanding the kernel code itself. The kernel has slowly been gaining those tracepoints, but, as Steven Rostedt has pointed out, there is a problem: each tracepoint adds something between 1KB and 5KB to the size of the kernel. When one starts to think about adding hundreds (or more) tracepoints, that overhead starts to add up.Steven, of course, is as good a person as any to blame for this problem, so he has set out to fix it. His nine-part patch moves some information to shared locations and eliminates unneeded stuff; the result was a 100KB size reduction in the size of his kernel. Needless to say, this seems like a savings worth having; it makes it that much more likely that tracepoints will actually be enabled in production kernels.
Of course, most of us will have to take Steven's word for it that the patches make sense; they are written in that special dialect of C preprocessor macros that mere kernel hackers fear to touch. So most of us are likely to take the memory savings, but won't look too closely at how they are achieved.
Kernel development news
Cleancache and Frontswap
Dan Magenheimer's transcendent memory patch was examined here last July. This patch creates a special class of memory which is not directly accessible to the rest of the kernel, allowing a number of special tricks to be played. Since then, transcendent memory has seemingly disappeared from view - until now, at least. Dan has returned with a pair of new abstractions - called "Cleancache" and "Frontswap" - each of which encapsulates a part of what transcendent memory does.
Cleancache is the less
controversial of the two. Dan describes it as
"
In such situations, the kernel could, instead of dropping the page, put it into the
Cleancache system with:
At some future point, if there is a need for the page, it can be retrieved
with:
The key point is that there is never any guarantee that
cleancache_get_page() will actually succeed in getting the page
back. The Cleancache code (or whatever mechanism sits behind it) is free
to drop the page at any time if it needs the memory for some other
purpose. So Cleancache users must be prepared to fall back to the real
backing store if cleancache_get_page() fails.
While Cleancache holds the page, it can do creative things with it. Pages
with duplicate contents are not uncommon, especially in virtualized
situations; often, significant numbers of pages contain only zeroes. The
backing store behind Cleancache can detect those duplicates and store a
single copy. Compression of stored pages is also possible; there is
currently work afoot to implement ramzswap (CompCache) as a Cleancache
backend. It might also be possible to use Cleancache as part of a
solid-state cache in front of a normal rotating drive.
Dan's patches include the addition of hooks to commonly-used filesystems so
that they will use Cleancache automatically.
The other half of the equation is Frontswap; unlike Cleancache, Frontswap is
meant to deal with dirty pages that the kernel would like to get rid of.
Once again, there is an interface for moving pages into and out of the
system:
The rules are a bit different, though: Frontswap is not required to accept
pages handed to it (so frontswap_put_page() can fail), but every
page it accepts is guaranteed to be there later when the kernel asks to get
it back.
Like Cleancache, Frontswap can play tricks with the stored pages to stretch its
memory resources. The real purpose behind this mechanism, though, appears
to be to enable a hypervisor to respond quickly to memory usage spikes in
virtualized guests. Dan put it this way:
Reviewers have been more skeptical of this mechanism. To some, it looks
like a way for dealing with shortcomings in the balloon driver, which is
already charged with implementing hypervisor decisions on how much memory
is to be made available to guests. If that is the case, it seems like
fixing the balloon driver might be the better approach.
Dan's response is that balloon drivers cannot respond quickly to memory
needs, and that regulating guest memory with a balloon driver can lead to
swap storms. This is, apparently, a real problem encountered by
virtualized systems in the field.
If, instead, the hypervisor maintains a pool of pages for Frontswap, it
can make them available quickly when the need arises, mitigating
memory-related performance problems.
Beyond that, Avi Kivity complains that
memory given to guests with Frontswap can never be recovered by the
hypervisor if those guests choose to hang onto it. Since operating systems
tend to be written to take advantage of all of the memory resources
available to them, it seems possible that Frontswap memory could fill
quickly and would stay full, leaving the hypervisor starving for memory
while maintaining pages it cannot get rid of. Avi also dislikes the
page-at-a-time, synchronous nature of the Frontswap API. Dan's response
here is that per-guest quotas will keep any guest from using too much
Frontswap space and that the API is better suited to the problem being
solved.
Complaints notwithstanding, Cleancache and Frontswap already appear to be
in reasonably wide use; they are shipping in OpenSUSE 11.2, Oracle's VM
virtualization product, and with Xen. Such distribution certainly
stretches the "upstream first" rule somewhat, but it also shows that there
is apparently a real use case for these features. Given that the patches
are not particularly intrusive and that the features have no cost if they
are not used, it seems that something along these lines should make it into
the mainline sooner or later.
The pm_qos code currently defines three quality of service parameters for
which requirements may be specified: CPU latency
(PM_QOS_CPU_DMA_LATENCY), network response latency
(PM_QOS_NETWORK_LATENCY), and network throughput
(PM_QOS_NETWORK_THROUGHPUT). The first two are specified in
microseconds; throughput is specified in KB/sec. Currently, CPU latency
requirements are observed by the cpuidle subsystem, and network
latency is observed only by the mac80211 layer. Any requests for a minimum
network throughput will fall on deaf ears in current kernels; given the
effectiveness of asking your editor's ISP for better service, one assumes
that the ignoring of throughput requests is simply a clever elimination of
useless work by the networking hackers.
The API for specifying quality of service parameters is:
For each of the above functions, qos is one of the parameters
listed above, name identifies the subsystem specifying the
requirement, and value is the new requirement. The name
string is used to identify a specific request in
pm_qos_update_requirement() and
pm_qos_remove_requirement(); it must match the value given when
the requirement was first added.
Kernel code which may make decisions affecting quality of service should
pay attention to the current requirements. There are two ways of doing
that, one of which being to just ask pm_qos what the tightest requirement
in effect is:
The alternative is to register a notifier which is called whenever a given
requirement changes, using:
This API has been around for some time, though it remains lightly used
within the kernel. One complaint which has been made is that the use of
strings to identify requirements leads to inefficient behavior: changing a
requirement involves walking a list and doing a bunch of string
comparisons. Requirements are, by their nature, specified by
latency-sensitive code, so it makes sense that the process should be fast.
The use of arbitrary strings also opens up a distant possibility of
confusion should two developers accidentally choose the same name.
In response to these problems, pm_qos hacker Mark Gross has proposed some changes to the
API. With the new version, "requirements" would become "requests," and the
use of strings to identify them would be removed. The new API for the
specification of
The pm_qos_request_list structure type is opaque to callers; it
serves only as a handle to identify a specific request. Changes and
removals can now be done with no list traversals and no string
comparisons.
On the other side, pm_qos_requirement() becomes
pm_qos_request(), but the API is otherwise unchanged.
This change seems uncontroversial, and it should address the criticisms
which have been made against this API. Unless something surprising
happens, the new API will probably be merged for 2.6.35.
This kernel has seen the addition of 9100 non-merge changesets from just
over 1100 developers. That makes it somewhat smaller than its
predecessors, as can be seen in this table:
Developer participation in this development cycle was slightly lower than
the usual, but not in any significant way. But, it seems, those developers
had a bit less than usual that they needed to get done. One might be
tempted to chalk that up to the shorter-than-usual merge window at the
beginning of this cycle, but the fact of the matter is that Linus let
enough new material in after 2.6.34-rc1 to make the merge window
effectively as long as it ever was.
The lists of the most active developers suggest that perhaps something else
was going on: many of the developers who traditionally put large amounts of
code into the kernel essentially sat out this cycle.
Sage Weil jumped to the top of both lists with the merger of the Ceph distributed filesystem and
the subsequent bug-fixing activity. Joe Perches is the new king of the
trivial patch; his work includes lots of checkpatch fixups, reworking print
statements in network drivers, and no less than 37 patches implementing a
rather belated cleanup of the floppy driver. Paul Mundt's work falls
almost exclusively within his role as the maintainer of the Super-H
architecture. Uwe Kleine-König works mostly within the ARM
architecture code, and Mark Brown continues as the source of large amounts
of sound driver and embedded processor code.
On the "lines changed" side, Vladislav Zolotarov only contributed nine
patches, all with the Broadcom NetXtreme II driver - but they included a
large replacement of the in-tree firmware. Jarod Wilson's count was even
smaller - three patches; he contributed the Broadcom Crystal HD driver to
the staging tree. Dimitris Michailidis earned his place on the list with
the new Chelsio Communications T4 Ethernet driver.
Just over 180 employers were identified as having contributed to 2.6.34 -
almost exactly the same as 2.6.33. With the 2.6.33 summary, your editor
suggested that Red Hat's position as the top contributor may soon be
threatened; let's see how that prediction worked out for 2.6.34:
Looking at absolute numbers, Red Hat's contributions declined considerably
from 2.6.33: 1223 changesets dropped to 934. Everybody else declined even
further, though; Intel's changeset count was less than half of its value
from 2.6.33. So Red Hat stays firmly at the top of the list. Many of the
other companies on the list will be unsurprising, but readers may be
forgiven for wondering about New Dream Network; that is a business
co-founded by Ceph developer Sage Weil.
If we look at non-author signoffs, we get a view of who the most active
gatekeepers for the kernel are. Here, there are no surprises at all:
Ten development cycles ago
(2.6.24), Andrew Morton was the most active gatekeeper, signing off on
almost 1700 patches. His role as subsystem maintainer of last resort has
declined over the years as more maintainers manage their own repositories
and push patches directly to Linus. Speaking of Linus, he not only didn't
make the list above, but he wasn't even close: his 71 signoffs put him in
the 22nd position. Dave Airlie's position on the list is an indication of
how much activity we are currently seeing in the graphics area.
Once again, over 50% of the patches heading into the mainline kernel pass
through the hands of somebody employed by either Red Hat or Novell.
As of this writing, the opening of the 2.6.35 merge window can be expected
sometime in the next 1-3 weeks. By the stated rules of the kernel
development process, the bulk of the code intended for that merge window
should already be in the linux-next tree. With that in mind, your editor
pulled down the May 4 edition of linux-next to see what was up. There
are currently 5144 non-merge changesets in that tree, representing 758
developers. The top contributors are:
Mauro Carvalho Chehab has had a busy development cycle; beyond large
amounts of Video4Linux work, he's jumped into the Nehelem EDAC (memory
error detection and correction) code and is
adding a new core for the management of infrared controllers. Eric Paris
has done a bunch of security cleanup work; he also has the fanotify subsystem queued up.
Eliot Blennerhassett, instead, has a single patch: a driver for
AudioScience sound devices.
It will be interesting to see how this list changes by the end of the
2.6.35 merge window. Even more interesting, arguably, will be the list of
top non-author signoffs:
Subsystem maintainers are the folks who are charged with getting work into
linux-next, so, if they all are doing their jobs, this list should not
change much through the merge window.
If the numbers do hold, 2.6.35 looks like another relatively subdued
development cycle without huge amounts of exciting new stuff. Things do
tend to change during the merge window, though, and surprises always show
up from somewhere. So, even with resources like linux-next, it's hard to
tell what the next development cycle will truly bring.
a page-granularity victim cache for clean pages
", which
should be crystal-clear to most LWN readers. For those who need a few more
words: Cleancache provides a place where the kernel can put pages which it
can afford to lose, but which it would like to keep around if possible. A
classic example is file-backed pages which are clean, so they can be
recovered from disk if need be. The kernel can drop such pages with no
data loss, but things will get slower if the page is needed in the near
future and must be read back from disk.
int cleancache_put_page(struct page *page);
int cleancache_get_page(struct page *page);
int frontswap_put_page(struct page *page);
int frontswap_get_page(struct page *page);
Reworking pm_qos
Aggressive power management is increasingly used to reduce the power
requirements of our systems. Sometimes, though, power management can,
through the creation of excessive latencies, get in the way of work which
needs to be done. One way to avoid problems is to have latency-sensitive
parts of the kernel express their requirements, which can then be taken
into account by the power management code. Tracking these requirements is
the task of the pm_qos ("power management quality of service") code.
Chances are that pm_qos will see a significant API change in 2.6.35.
#include <linux/pm_qos_params.h>
int pm_qos_add_requirement(int qos, char *name, s32 value);
int pm_qos_update_requirement(int qos, char *name, s32 value);
void pm_qos_remove_requirement(int qos, char *name);
int pm_qos_requirement(int qos);
int pm_qos_add_notifier(int qos, struct notifier_block *notifier);
int pm_qos_remove_notifier(int qos, struct notifier_block *notifier);
requirements requests is:
struct pm_qos_request_list *pm_qos_add_request(int qos, s32 value);
void pm_qos_update_request(struct pm_qos_request_list *pm_qos_req,
s32 new_value);
void pm_qos_remove_request(struct pm_qos_request_list *pm_qos_req);
Kernel development statistics for 2.6.34 and beyond
As of this writing, the current kernel prepatch is 2.6.34-rc6. A couple
more prepatches are most likely due before the final release, but the
number of changes to be found there should be small. In other words,
2.6.34 is close to its final form, so it makes sense to take a look at what
has gone into this development cycle. In a few ways, 2.6.34 is an unusual
kernel.
Kernel Patches Devs 2.6.29
11,600
1170 2.6.30
11,700
1130 2.6.31
10,600
1150 2.6.32
10,800
1230 2.6.33
10,500
1150 2.6.34
9,100
1110
Most active 2.6.34 developers
By changesets
Sage Weil 212 2.3%
Joe Perches 169 1.9%
Paul Mundt 153 1.7%
Uwe Kleine-König 109 1.2%
Mark Brown 102 1.1%
Ben Dooks 96 1.1%
Rafał Miłecki 88 1.0%
Dan Carpenter 84 0.9%
Alex Deucher 83 0.9%
H Hartley Sweeten 80 0.9%
Christoph Hellwig 75 0.8%
Johannes Berg 74 0.8%
Arnaldo Carvalho de Melo 72 0.8%
Bartlomiej Zolnierkiewicz 64 0.7%
David S. Miller 63 0.7%
Magnus Damm 63 0.7%
By changed lines
Sage Weil 30233 4.1%
Vladislav Zolotarov 23119 3.2%
Jarod Wilson 19689 2.7%
Mark Brown 18513 2.5%
Dimitris Michailidis 13919 1.9%
Manuel Lauss 11831 1.6%
Jörn Engel 10810 1.5%
Kukjin Kim 10142 1.4%
Alex Deucher 9785 1.3%
Amit Kumar Salecha 9391 1.3%
Michael Chan 9336 1.3%
Joe Perches 8738 1.2%
Paul Mundt 8438 1.2%
Haojian Zhuang 8403 1.1%
Magnus Damm 8320 1.1%
Matthias Benesch 7739 1.1%
Most active 2.6.34 employers
By changesets
(None) 1455 16.0%
(Unknown) 959 10.5%
Red Hat 934 10.3%
Intel 472 5.2%
IBM 354 3.9%
Novell 329 3.6%
(Consultant) 274 3.0%
Nokia 248 2.7%
New Dream Network 237 2.6%
Renesas Technology 188 2.1%
Texas Instruments 180 2.0%
Pengutronix 154 1.7%
Oracle 144 1.6%
HP 128 1.4%
(Academia) 125 1.4%
Analog Devices 123 1.4%
AMD 121 1.3%
Fujitsu 121 1.3%
Marvell 120 1.3%
Wolfson Microelectronics 101 1.1%
By lines changed
Red Hat 75235 10.3%
(None) 75160 10.3%
(Unknown) 67541 9.2%
Broadcom 56595 7.7%
Intel 33175 4.5%
New Dream Network 31501 4.3%
(Consultant) 29140 4.0%
Novell 24217 3.3%
Wolfson Microelectronics 20660 2.8%
Renesas Technology 16205 2.2%
Chelsio 13937 1.9%
IBM 13618 1.9%
QLogic 13182 1.8%
MSC Vertriebs GmbH 12545 1.7%
Samsung 12224 1.7%
Marvell 11914 1.6%
Texas Instruments 11228 1.5%
Analog Devices 11047 1.5%
AMD 10894 1.5%
Nokia 10217 1.4%
Most non-author signoffs
By developer
David S. Miller 1034 13.0%
Greg Kroah-Hartman 780 9.8%
Andrew Morton 546 6.9%
John W. Linville 546 6.9%
Ingo Molnar 348 4.4%
Mauro Carvalho Chehab 330 4.2%
James Bottomley 244 3.1%
Dave Airlie 150 1.9%
Ralf Baechle 144 1.8%
H. Peter Anvin 141 1.8%
By employer
Red Hat 2865 36.1%
Novell 1293 16.3%
Intel 565 7.1%
Google 547 6.9%
(None) 365 4.6%
IBM 289 3.6%
(Consultant) 194 2.4%
Wind River 145 1.8%
Atomide 130 1.6%
Oracle 128 1.6% Looking forward
Most active linux-next developers
By changesets
Mauro Carvalho Chehab 245 4.8%
Eric Paris 103 2.0%
Alexander Graf 84 1.6%
Johannes Berg 59 1.1%
Juuso Oikarinen 59 1.1%
Jean-François Moine 58 1.1%
Luis R. Rodriguez 58 1.1%
Greg Kroah-Hartman 52 1.0%
Sujith 52 1.0%
Dan Carpenter 51 1.0%
By changed lines
Mauro Carvalho Chehab 28743 6.2%
Eliot Blennerhassett 18429 4.0%
Bob Beers 11703 2.5%
Luis R. Rodriguez 10507 2.3%
Steve Wise 9447 2.0%
Viresh Kumar 9426 2.0%
Jason Wessel 8739 1.9%
Sjur Braendeland 8685 1.9%
Stephen Rothwell 7908 1.7%
Matthias Benesch 7739 1.7%
Most non-author signoffs (linux-next)
Mauro Carvalho Chehab 651 13.8%
John W. Linville 507 10.8%
David Miller 462 9.8%
Greg Kroah-Hartman 411 8.7%
Ingo Molnar 170 3.6%
Avi Kivity 156 3.3%
James Bottomley 155 3.3%
Reinette Chatre 98 2.1%
David Woodhouse 93 2.0%
Marcelo Tosatti 72 1.5%
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page:
Distributions>>
