Brief items
The current development kernel is 2.6.34-rc6,
released on April 29. This
prepatch includes a lot of fixes, supplemented by the VMware balloon driver
(
discussed briefly here in
early April) and
the ipeth
driver which facilitates USB tethering to iPhones. The short-form
changelog is in the announcement, or see
the
full changelog for the details.
Stable updates have been nonexistent over the last week.
Comments (none posted)
If you were using two processes then I'd cheerily blame the
scheduler. Because blaming the scheduler for WeirdShitWhichBroke
is usually correct.
--
Andrew Morton
The Red Hat Enterprise Linux 6 kernel includes numerous subsystems and enhancements from 2.6.34, as well as its predecessor versions. As a result, the Red Hat Enterprise Linux 6 kernel cannot be simply labeled as any particular upstream version. Rather, the Red Hat Enterprise Linux 6 kernel is a hybrid of the latest several kernel versions. And, as Red Hat provides regular updates over the lifecycle of the product, we expect that the Red Hat Enterprise Linux 6 kernel will incorporate selected features from future upstream kernels that have yet to be developed.
--
Red Hat Enterprise Linux Team
My problem is I'm incredibly busy at the moment, and I've
already done Ubuntu a huge favor by spending ten minutes to do a
quickie investigation. Ubuntu needs to learn that it can't rely on
upstream developers to jump through flaming hoops on short notice
before a LTS release deadline as a cost-saving mechanism to avoid
hiring their own senior kernel engineers.
--
Ted
Ts'o
Talk about high level designs rarely gets any traction, and often
goes nowhere. Give us an example implementation so there is
something concrete for us to sink our teeth into.
--
David Miller
Comments (18 posted)
By Jonathan Corbet
May 5, 2010
Back in February, the
checkpoint/restart patch set was brought to the kernel mailing list with a
request for inclusion in the -mm tree. That was immediately prior to the
2.6.34 merge window, so there were limited amounts of developer attention
available for review. At that time, Andrew Morton
suggested:
I'd suggest waiting until very shortly after 2.6.34-rc1 then please
send all the patches onto the list and let's get to work.
The checkpoint/restart developers did post the the patches in March, to
relatively little response. Shortly before the
2.6.35 merge window, they reposted the whole thing as a 100-patch series.
Unsurprisingly, there have been some complaints about the massive mailing,
but there is another outcome which is less fortunate: the patches are not
being looked at.
That, too, is unsurprising. The amount of developer time available for
patch review is insufficient in the best of times, and it gets worse as the
merge window approaches. Even the most seasoned reviewer is going to be a
bit intimidated by a 100-patch series which pokes its fingers into almost
every part of the core kernel. Most of them will decide that they have
more important things to do elsewhere.
So, once again, checkpoint/restart is likely to be put on hold until after
the next merge window. After that, if it comes back in more manageable
pieces, the developers might truly get to work.
Comments (6 posted)
By Jonathan Corbet
May 5, 2010
Support for tracing in the Linux kernel has made great strides over the
last couple of years. One of the key features of a mature tracing system,
though, is a long list of well-defined, well-documented tracepoints which
allow a system administrator to hook into kernel events without
understanding the kernel code itself. The kernel has slowly been gaining
those tracepoints, but, as Steven Rostedt has
pointed out, there is a problem:
each tracepoint adds something between 1KB and 5KB to the size of the
kernel. When one starts to think about adding hundreds (or more)
tracepoints, that overhead starts to add up.
Steven, of course, is as good a person as any to blame for this problem, so
he has set out to fix it. His nine-part patch moves some information to
shared locations and eliminates unneeded stuff; the result was a 100KB size
reduction in the size of his kernel. Needless to say, this seems like a
savings worth having; it makes it that much more likely that tracepoints
will actually be enabled in production kernels.
Of course, most of us will have to take Steven's word for it that the
patches make sense; they are written in that special dialect of C
preprocessor macros that mere kernel hackers fear to touch. So most of us
are likely to take the memory savings, but won't look too closely at how
they are achieved.
Comments (7 posted)
Kernel development news
By Jonathan Corbet
May 4, 2010
Dan Magenheimer's
transcendent
memory patch was examined here last July. This patch creates a special
class of memory which is not directly accessible to the rest of the kernel,
allowing a number of special tricks to be played. Since then, transcendent
memory has seemingly disappeared from view - until now, at least. Dan has
returned with a pair of new abstractions - called "Cleancache" and
"Frontswap" - each of which encapsulates a part of what transcendent memory
does.
Cleancache is the less
controversial of the two. Dan describes it as
"a page-granularity victim cache for clean pages," which
should be crystal-clear to most LWN readers. For those who need a few more
words: Cleancache provides a place where the kernel can put pages which it
can afford to lose, but which it would like to keep around if possible. A
classic example is file-backed pages which are clean, so they can be
recovered from disk if need be. The kernel can drop such pages with no
data loss, but things will get slower if the page is needed in the near
future and must be read back from disk.
In such situations, the kernel could, instead of dropping the page, put it into the
Cleancache system with:
int cleancache_put_page(struct page *page);
At some future point, if there is a need for the page, it can be retrieved
with:
int cleancache_get_page(struct page *page);
The key point is that there is never any guarantee that
cleancache_get_page() will actually succeed in getting the page
back. The Cleancache code (or whatever mechanism sits behind it) is free
to drop the page at any time if it needs the memory for some other
purpose. So Cleancache users must be prepared to fall back to the real
backing store if cleancache_get_page() fails.
While Cleancache holds the page, it can do creative things with it. Pages
with duplicate contents are not uncommon, especially in virtualized
situations; often, significant numbers of pages contain only zeroes. The
backing store behind Cleancache can detect those duplicates and store a
single copy. Compression of stored pages is also possible; there is
currently work afoot to implement ramzswap (CompCache) as a Cleancache
backend. It might also be possible to use Cleancache as part of a
solid-state cache in front of a normal rotating drive.
Dan's patches include the addition of hooks to commonly-used filesystems so
that they will use Cleancache automatically.
The other half of the equation is Frontswap; unlike Cleancache, Frontswap is
meant to deal with dirty pages that the kernel would like to get rid of.
Once again, there is an interface for moving pages into and out of the
system:
int frontswap_put_page(struct page *page);
int frontswap_get_page(struct page *page);
The rules are a bit different, though: Frontswap is not required to accept
pages handed to it (so frontswap_put_page() can fail), but every
page it accepts is guaranteed to be there later when the kernel asks to get
it back.
Like Cleancache, Frontswap can play tricks with the stored pages to stretch its
memory resources. The real purpose behind this mechanism, though, appears
to be to enable a hypervisor to respond quickly to memory usage spikes in
virtualized guests. Dan put it this way:
Frontswap serves nicely as an emergency safety valve when a guest
has given up (too) much of its memory via ballooning but
unexpectedly has an urgent need that can't be serviced quickly
enough by the balloon driver.
Reviewers have been more skeptical of this mechanism. To some, it looks
like a way for dealing with shortcomings in the balloon driver, which is
already charged with implementing hypervisor decisions on how much memory
is to be made available to guests. If that is the case, it seems like
fixing the balloon driver might be the better approach.
Dan's response is that balloon drivers cannot respond quickly to memory
needs, and that regulating guest memory with a balloon driver can lead to
swap storms. This is, apparently, a real problem encountered by
virtualized systems in the field.
If, instead, the hypervisor maintains a pool of pages for Frontswap, it
can make them available quickly when the need arises, mitigating
memory-related performance problems.
Beyond that, Avi Kivity complains that
memory given to guests with Frontswap can never be recovered by the
hypervisor if those guests choose to hang onto it. Since operating systems
tend to be written to take advantage of all of the memory resources
available to them, it seems possible that Frontswap memory could fill
quickly and would stay full, leaving the hypervisor starving for memory
while maintaining pages it cannot get rid of. Avi also dislikes the
page-at-a-time, synchronous nature of the Frontswap API. Dan's response
here is that per-guest quotas will keep any guest from using too much
Frontswap space and that the API is better suited to the problem being
solved.
Complaints notwithstanding, Cleancache and Frontswap already appear to be
in reasonably wide use; they are shipping in OpenSUSE 11.2, Oracle's VM
virtualization product, and with Xen. Such distribution certainly
stretches the "upstream first" rule somewhat, but it also shows that there
is apparently a real use case for these features. Given that the patches
are not particularly intrusive and that the features have no cost if they
are not used, it seems that something along these lines should make it into
the mainline sooner or later.
Comments (1 posted)
By Jonathan Corbet
May 4, 2010
Aggressive power management is increasingly used to reduce the power
requirements of our systems. Sometimes, though, power management can,
through the creation of excessive latencies, get in the way of work which
needs to be done. One way to avoid problems is to have latency-sensitive
parts of the kernel express their requirements, which can then be taken
into account by the power management code. Tracking these requirements is
the task of the pm_qos ("power management quality of service") code.
Chances are that pm_qos will see a significant API change in 2.6.35.
The pm_qos code currently defines three quality of service parameters for
which requirements may be specified: CPU latency
(PM_QOS_CPU_DMA_LATENCY), network response latency
(PM_QOS_NETWORK_LATENCY), and network throughput
(PM_QOS_NETWORK_THROUGHPUT). The first two are specified in
microseconds; throughput is specified in KB/sec. Currently, CPU latency
requirements are observed by the cpuidle subsystem, and network
latency is observed only by the mac80211 layer. Any requests for a minimum
network throughput will fall on deaf ears in current kernels; given the
effectiveness of asking your editor's ISP for better service, one assumes
that the ignoring of throughput requests is simply a clever elimination of
useless work by the networking hackers.
The API for specifying quality of service parameters is:
#include <linux/pm_qos_params.h>
int pm_qos_add_requirement(int qos, char *name, s32 value);
int pm_qos_update_requirement(int qos, char *name, s32 value);
void pm_qos_remove_requirement(int qos, char *name);
For each of the above functions, qos is one of the parameters
listed above, name identifies the subsystem specifying the
requirement, and value is the new requirement. The name
string is used to identify a specific request in
pm_qos_update_requirement() and
pm_qos_remove_requirement(); it must match the value given when
the requirement was first added.
Kernel code which may make decisions affecting quality of service should
pay attention to the current requirements. There are two ways of doing
that, one of which being to just ask pm_qos what the tightest requirement
in effect is:
int pm_qos_requirement(int qos);
The alternative is to register a notifier which is called whenever a given
requirement changes, using:
int pm_qos_add_notifier(int qos, struct notifier_block *notifier);
int pm_qos_remove_notifier(int qos, struct notifier_block *notifier);
This API has been around for some time, though it remains lightly used
within the kernel. One complaint which has been made is that the use of
strings to identify requirements leads to inefficient behavior: changing a
requirement involves walking a list and doing a bunch of string
comparisons. Requirements are, by their nature, specified by
latency-sensitive code, so it makes sense that the process should be fast.
The use of arbitrary strings also opens up a distant possibility of
confusion should two developers accidentally choose the same name.
In response to these problems, pm_qos hacker Mark Gross has proposed some changes to the
API. With the new version, "requirements" would become "requests," and the
use of strings to identify them would be removed. The new API for the
specification of requirements requests is:
struct pm_qos_request_list *pm_qos_add_request(int qos, s32 value);
void pm_qos_update_request(struct pm_qos_request_list *pm_qos_req,
s32 new_value);
void pm_qos_remove_request(struct pm_qos_request_list *pm_qos_req);
The pm_qos_request_list structure type is opaque to callers; it
serves only as a handle to identify a specific request. Changes and
removals can now be done with no list traversals and no string
comparisons.
On the other side, pm_qos_requirement() becomes
pm_qos_request(), but the API is otherwise unchanged.
This change seems uncontroversial, and it should address the criticisms
which have been made against this API. Unless something surprising
happens, the new API will probably be merged for 2.6.35.
Comments (2 posted)
By Jonathan Corbet
May 4, 2010
As of this writing, the current kernel prepatch is 2.6.34-rc6. A couple
more prepatches are most likely due before the final release, but the
number of changes to be found there should be small. In other words,
2.6.34 is close to its final form, so it makes sense to take a look at what
has gone into this development cycle. In a few ways, 2.6.34 is an unusual
kernel.
This kernel has seen the addition of 9100 non-merge changesets from just
over 1100 developers. That makes it somewhat smaller than its
predecessors, as can be seen in this table:
| Kernel | Patches | Devs |
| 2.6.29 |
11,600 |
1170 |
| 2.6.30 |
11,700 |
1130 |
| 2.6.31 |
10,600 |
1150 |
| 2.6.32 |
10,800 |
1230 |
| 2.6.33 |
10,500 |
1150 |
| 2.6.34 |
9,100 |
1110 |
Developer participation in this development cycle was slightly lower than
the usual, but not in any significant way. But, it seems, those developers
had a bit less than usual that they needed to get done. One might be
tempted to chalk that up to the shorter-than-usual merge window at the
beginning of this cycle, but the fact of the matter is that Linus let
enough new material in after 2.6.34-rc1 to make the merge window
effectively as long as it ever was.
The lists of the most active developers suggest that perhaps something else
was going on: many of the developers who traditionally put large amounts of
code into the kernel essentially sat out this cycle.
| Most active 2.6.34 developers |
| By changesets |
| Sage Weil | 212 | 2.3% |
| Joe Perches | 169 | 1.9% |
| Paul Mundt | 153 | 1.7% |
| Uwe Kleine-König | 109 | 1.2% |
| Mark Brown | 102 | 1.1% |
| Ben Dooks | 96 | 1.1% |
| Rafał Miłecki | 88 | 1.0% |
| Dan Carpenter | 84 | 0.9% |
| Alex Deucher | 83 | 0.9% |
| H Hartley Sweeten | 80 | 0.9% |
| Christoph Hellwig | 75 | 0.8% |
| Johannes Berg | 74 | 0.8% |
| Arnaldo Carvalho de Melo | 72 | 0.8% |
| Bartlomiej Zolnierkiewicz | 64 | 0.7% |
| David S. Miller | 63 | 0.7% |
| Magnus Damm | 63 | 0.7% |
|
| By changed lines |
| Sage Weil | 30233 | 4.1% |
| Vladislav Zolotarov | 23119 | 3.2% |
| Jarod Wilson | 19689 | 2.7% |
| Mark Brown | 18513 | 2.5% |
| Dimitris Michailidis | 13919 | 1.9% |
| Manuel Lauss | 11831 | 1.6% |
| Jörn Engel | 10810 | 1.5% |
| Kukjin Kim | 10142 | 1.4% |
| Alex Deucher | 9785 | 1.3% |
| Amit Kumar Salecha | 9391 | 1.3% |
| Michael Chan | 9336 | 1.3% |
| Joe Perches | 8738 | 1.2% |
| Paul Mundt | 8438 | 1.2% |
| Haojian Zhuang | 8403 | 1.1% |
| Magnus Damm | 8320 | 1.1% |
| Matthias Benesch | 7739 | 1.1% |
|
Sage Weil jumped to the top of both lists with the merger of the Ceph distributed filesystem and
the subsequent bug-fixing activity. Joe Perches is the new king of the
trivial patch; his work includes lots of checkpatch fixups, reworking print
statements in network drivers, and no less than 37 patches implementing a
rather belated cleanup of the floppy driver. Paul Mundt's work falls
almost exclusively within his role as the maintainer of the Super-H
architecture. Uwe Kleine-König works mostly within the ARM
architecture code, and Mark Brown continues as the source of large amounts
of sound driver and embedded processor code.
On the "lines changed" side, Vladislav Zolotarov only contributed nine
patches, all with the Broadcom NetXtreme II driver - but they included a
large replacement of the in-tree firmware. Jarod Wilson's count was even
smaller - three patches; he contributed the Broadcom Crystal HD driver to
the staging tree. Dimitris Michailidis earned his place on the list with
the new Chelsio Communications T4 Ethernet driver.
Just over 180 employers were identified as having contributed to 2.6.34 -
almost exactly the same as 2.6.33. With the 2.6.33 summary, your editor
suggested that Red Hat's position as the top contributor may soon be
threatened; let's see how that prediction worked out for 2.6.34:
| Most active 2.6.34 employers |
| By changesets |
| (None) | 1455 | 16.0% |
| (Unknown) | 959 | 10.5% |
| Red Hat | 934 | 10.3% |
| Intel | 472 | 5.2% |
| IBM | 354 | 3.9% |
| Novell | 329 | 3.6% |
| (Consultant) | 274 | 3.0% |
| Nokia | 248 | 2.7% |
| New Dream Network | 237 | 2.6% |
| Renesas Technology | 188 | 2.1% |
| Texas Instruments | 180 | 2.0% |
| Pengutronix | 154 | 1.7% |
| Oracle | 144 | 1.6% |
| HP | 128 | 1.4% |
| (Academia) | 125 | 1.4% |
| Analog Devices | 123 | 1.4% |
| AMD | 121 | 1.3% |
| Fujitsu | 121 | 1.3% |
| Marvell | 120 | 1.3% |
| Wolfson Microelectronics | 101 | 1.1% |
|
| By lines changed |
| Red Hat | 75235 | 10.3% |
| (None) | 75160 | 10.3% |
| (Unknown) | 67541 | 9.2% |
| Broadcom | 56595 | 7.7% |
| Intel | 33175 | 4.5% |
| New Dream Network | 31501 | 4.3% |
| (Consultant) | 29140 | 4.0% |
| Novell | 24217 | 3.3% |
| Wolfson Microelectronics | 20660 | 2.8% |
| Renesas Technology | 16205 | 2.2% |
| Chelsio | 13937 | 1.9% |
| IBM | 13618 | 1.9% |
| QLogic | 13182 | 1.8% |
| MSC Vertriebs GmbH | 12545 | 1.7% |
| Samsung | 12224 | 1.7% |
| Marvell | 11914 | 1.6% |
| Texas Instruments | 11228 | 1.5% |
| Analog Devices | 11047 | 1.5% |
| AMD | 10894 | 1.5% |
| Nokia | 10217 | 1.4% |
|
Looking at absolute numbers, Red Hat's contributions declined considerably
from 2.6.33: 1223 changesets dropped to 934. Everybody else declined even
further, though; Intel's changeset count was less than half of its value
from 2.6.33. So Red Hat stays firmly at the top of the list. Many of the
other companies on the list will be unsurprising, but readers may be
forgiven for wondering about New Dream Network; that is a business
co-founded by Ceph developer Sage Weil.
If we look at non-author signoffs, we get a view of who the most active
gatekeepers for the kernel are. Here, there are no surprises at all:
| Most non-author signoffs |
| By developer |
| David S. Miller | 1034 | 13.0% |
| Greg Kroah-Hartman | 780 | 9.8% |
| Andrew Morton | 546 | 6.9% |
| John W. Linville | 546 | 6.9% |
| Ingo Molnar | 348 | 4.4% |
| Mauro Carvalho Chehab | 330 | 4.2% |
| James Bottomley | 244 | 3.1% |
| Dave Airlie | 150 | 1.9% |
| Ralf Baechle | 144 | 1.8% |
| H. Peter Anvin | 141 | 1.8% |
|
| By employer |
| Red Hat | 2865 | 36.1% |
| Novell | 1293 | 16.3% |
| Intel | 565 | 7.1% |
| Google | 547 | 6.9% |
| (None) | 365 | 4.6% |
| IBM | 289 | 3.6% |
| (Consultant) | 194 | 2.4% |
| Wind River | 145 | 1.8% |
| Atomide | 130 | 1.6% |
| Oracle | 128 | 1.6% |
|
Ten development cycles ago
(2.6.24), Andrew Morton was the most active gatekeeper, signing off on
almost 1700 patches. His role as subsystem maintainer of last resort has
declined over the years as more maintainers manage their own repositories
and push patches directly to Linus. Speaking of Linus, he not only didn't
make the list above, but he wasn't even close: his 71 signoffs put him in
the 22nd position. Dave Airlie's position on the list is an indication of
how much activity we are currently seeing in the graphics area.
Once again, over 50% of the patches heading into the mainline kernel pass
through the hands of somebody employed by either Red Hat or Novell.
Looking forward
As of this writing, the opening of the 2.6.35 merge window can be expected
sometime in the next 1-3 weeks. By the stated rules of the kernel
development process, the bulk of the code intended for that merge window
should already be in the linux-next tree. With that in mind, your editor
pulled down the May 4 edition of linux-next to see what was up. There
are currently 5144 non-merge changesets in that tree, representing 758
developers. The top contributors are:
| Most active linux-next developers |
| By changesets |
| Mauro Carvalho Chehab | 245 | 4.8% |
| Eric Paris | 103 | 2.0% |
| Alexander Graf | 84 | 1.6% |
| Johannes Berg | 59 | 1.1% |
| Juuso Oikarinen | 59 | 1.1% |
| Jean-François Moine | 58 | 1.1% |
| Luis R. Rodriguez | 58 | 1.1% |
| Greg Kroah-Hartman | 52 | 1.0% |
| Sujith | 52 | 1.0% |
| Dan Carpenter | 51 | 1.0% |
|
| By changed lines |
| Mauro Carvalho Chehab | 28743 | 6.2% |
| Eliot Blennerhassett | 18429 | 4.0% |
| Bob Beers | 11703 | 2.5% |
| Luis R. Rodriguez | 10507 | 2.3% |
| Steve Wise | 9447 | 2.0% |
| Viresh Kumar | 9426 | 2.0% |
| Jason Wessel | 8739 | 1.9% |
| Sjur Braendeland | 8685 | 1.9% |
| Stephen Rothwell | 7908 | 1.7% |
| Matthias Benesch | 7739 | 1.7% |
|
Mauro Carvalho Chehab has had a busy development cycle; beyond large
amounts of Video4Linux work, he's jumped into the Nehelem EDAC (memory
error detection and correction) code and is
adding a new core for the management of infrared controllers. Eric Paris
has done a bunch of security cleanup work; he also has the fanotify subsystem queued up.
Eliot Blennerhassett, instead, has a single patch: a driver for
AudioScience sound devices.
It will be interesting to see how this list changes by the end of the
2.6.35 merge window. Even more interesting, arguably, will be the list of
top non-author signoffs:
| Most non-author signoffs (linux-next) |
|
|
| Mauro Carvalho Chehab | 651 | 13.8% |
| John W. Linville | 507 | 10.8% |
| David Miller | 462 | 9.8% |
| Greg Kroah-Hartman | 411 | 8.7% |
| Ingo Molnar | 170 | 3.6% |
| Avi Kivity | 156 | 3.3% |
| James Bottomley | 155 | 3.3% |
| Reinette Chatre | 98 | 2.1% |
| David Woodhouse | 93 | 2.0% |
| Marcelo Tosatti | 72 | 1.5% |
Subsystem maintainers are the folks who are charged with getting work into
linux-next, so, if they all are doing their jobs, this list should not
change much through the merge window.
If the numbers do hold, 2.6.35 looks like another relatively subdued
development cycle without huge amounts of exciting new stuff. Things do
tend to change during the merge window, though, and surprises always show
up from somewhere. So, even with resources like linux-next, it's hard to
tell what the next development cycle will truly bring.
Comments (14 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>