Brief items
The current 2.6 prepatch remains 2.6.22-rc4. Patches continue to
flow into the mainline repository; they are mostly fixes, but the
ZERO_SIZE_PTR patch for the SLUB
allocator has also gone in.
The current -mm tree is 2.6.22-rc4-mm2. Recent changes
to -mm are almost all fixes aimed at stabilizing this tree somewhat.
The current stable 2.6 kernel is 2.6.21.5, released on June 11 with a
rather long list of fixes. 2.6.21.4 was released on
June 8 with a set of security fixes: "The /dev/[u]random fix is especially important for machines with no
entropy source (e.g. keyboard, mice, or disk drives) and no realtime clock
since successive boots could generate same output from RNG. The cpuset
bug is a possible information leak when reading from /dev/cpuset/tasks
(assuming cpusets support is compiled in and the cpuset fs mounted
on /dev/cpuset). The SCTP bug is remotely triggerable when using SCTP
conntrack."
For older kernels: 2.6.20.13 was released on
June 8 with the same security fixes; it was followed by 2.6.20.14 (June 11), which
contained a large assortment of patches.
2.4.34.5 was released on
June 6 with a small set of fixes. The 2.4.35 process continues with
2.4.35-pre5, also released on
the 6th.
Comments (none posted)
Kernel development news
The overall quality of 2.6.21 is pretty horrific. It saw the
introduction of a lot of new code fundamental to the operation of
the kernel (the tickless stuff for eg), massive updates to areas
such as ACPI, and just to mix things up, we switched from a
known-crap-but-tried-and-tested IDE system to
a-bleeding-edge-but-hopefully-with-signs-of-promise libata based
system. Lots of changes == lots of fallout the first time it goes
into a production OS.
--
Dave Jones
What I am objecting to is this idea that many kernel developers
seem to have, that if there is some aspect of the kernel/user API
that becomes a bit inconvenient for the kernel to implement, then
we can put the blame on the applications that rely on that aspect,
call them names such as "legacy", "abuser", "conceptually buggy",
"broken", etc., and ultimately justify breaking the ABI -- since
it's only those applications that we have demonised that will be
affected, after all.
--
Paul Mackerras
/* I'm told there are only two stories in the world worth telling: love
* and hate. So there used to be a love scene here like this:
*
* Launcher: We could make beautiful I/O together, you and I.
* Guest: My, that's a big disk!
*
* Unfortunately, it was just too raunchy for our otherwise-gentle tale.
*/
--
Rusty Russell gets into
literate programming
Comments (2 posted)
For the curious, here's a recent posting from Linus Torvalds on Sun's
motivations and GPLv3. "
So to Sun, a GPLv3-only release would actually let them look good, and
still keep Linux from taking their interesting parts, and would allow them
to take at least parts of Linux without giving anything back (ahh, the
joys of license fragmentation).
Of course, they know that. And yes, maybe ZFS is worthwhile enough that
I'm willing to go to the effort of trying to relicense the kernel. But
quite frankly, I can almost guarantee that Sun won't release ZFS under the
GPLv3 even if they release other parts. Because if they did, they'd lose
the patent protection."
Full Story (comments: 50)
Support for ATI R500 graphics chipsets has been one of the biggest missing pieces from
the Linux free driver collection. That has just changed with the release
of an early driver for R500 chipsets written from reverse-engineered
specs. The driver only does 2D for now, but 3D support is in the works.
Unsurprisingly, the development team would like help in getting this driver
ready for production use. This release is an important step forward;
congratulations are due to the developers who have brought this work this
far.
Full Story (comments: 29)
The 2.6.22 kernel is getting closer to its final state with its official
release likely to happen near the end of this month. Patches are still
being added to the mainline repository, but things have stabilized enough
that it makes sense to take a look at where the code came from this time
around. Accordingly, your editor has fixed up his scripts and cranked
through the changesets added in this kernel development cycle.
As of this writing, just over 6,000 changesets have been accepted for
2.6.22. Those patches were contributed by 885 different developers, added
494,000 lines, and deleted 241,000 other lines (without counting renames,
which would otherwise increase both numbers by about 60,000 lines). That
makes 2.6.22 a large change relative to its immediate predecessors:
| Release | Developers | Changesets |
Lines added | Lines removed |
| 2.6.20 | 741 | 4983 | 286,000 |
160,000 |
| 2.6.21 | 842 | 5349 | 343,000 |
199,000 |
| 2.6.22-rc4+ | 885 | 6093 |
494,000 | 241,000 |
Here's the top contributors of those changes:
| Most active 2.6.22 developers |
| By changesets |
| David S. Miller | 175 | 3.0% |
| Kristian Høgsberg | 109 | 1.9% |
| Stephen Hemminger | 86 | 1.5% |
| Arnaldo Carvalho de Melo | 82 | 1.4% |
| Andrew Morton | 79 | 1.3% |
| Stefan Richter | 79 | 1.3% |
| Christoph Lameter | 77 | 1.3% |
| Patrick McHardy | 76 | 1.3% |
| Jean Delvare | 75 | 1.3% |
| Dmitry Torokhov | 70 | 1.2% |
| Stephen Rothwell | 68 | 1.2% |
| Paul Mundt | 66 | 1.1% |
| David Brownell | 65 | 1.1% |
| Jeff Dike | 63 | 1.1% |
| Alan Cox | 60 | 1.0% |
| Andi Kleen | 59 | 1.0% |
| Antonino Daplas | 58 | 1.0% |
| Adrian Bunk | 58 | 1.0% |
| Tejun Heo | 57 | 1.0% |
| Russell King | 57 | 1.0% |
|
| By changed lines |
| Bryan Wu | 77594 | 12.9% |
| David Howells | 23310 | 3.9% |
| Marcelo Tosatti | 22351 | 3.7% |
| Patrick McHardy | 21746 | 3.6% |
| Jiri Benc | 18328 | 3.0% |
| Hans Verkuil | 13683 | 2.3% |
| David S. Miller | 13595 | 2.3% |
| Roland Dreier | 12247 | 2.0% |
| Artem B. Bityutskiy | 12065 | 2.0% |
| Kristian Høgsberg | 11153 | 1.9% |
| Robert P. J. Day | 7554 | 1.3% |
| Christoph Lameter | 7378 | 1.2% |
| Andrew Victor | 6638 | 1.1% |
| Mike Frysinger | 6313 | 1.0% |
| David Brownell | 6033 | 1.0% |
| Michael Chan | 5851 | 1.0% |
| Andi Kleen | 5431 | 0.9% |
| David Gibson | 5321 | 0.9% |
| Nobuhiro Iwamatsu | 5296 | 0.9% |
| Mark Fasheh | 4921 | 0.8% |
|
Bryan Wu makes it to the top of the list of contributors (by lines changed)
by virtue of being the person to contribute support for the Blackfin
architecture. David Howells contributed the AF_RXRPC and AFS filesystem
work; Marcelo Tosatti wrote the OLPC "Libertas" wireless driver, and Jiri
Benc's name appears on the mac80211 stack.
When broken down by employer, the (approximate, as always) numbers come out
like this:
| Most active 2.6.22 employers |
| By changesets |
| (Unknown) | 1766 | 30.2% |
| Red Hat | 720 | 12.3% |
| IBM | 601 | 10.3% |
| Novell | 411 | 7.0% |
| (None) | 245 | 4.2% |
| Intel | 203 | 3.5% |
| Oracle | 127 | 2.2% |
| (Consultant) | 119 | 2.0% |
| Linux Foundation | 116 | 2.0% |
| Google | 111 | 1.9% |
| SGI | 93 | 1.6% |
| Nokia | 83 | 1.4% |
| Freescale | 80 | 1.4% |
| Astaro | 76 | 1.3% |
| XenSource | 56 | 1.0% |
| MontaVista | 56 | 1.0% |
| Qumranet | 55 | 0.9% |
| HP | 53 | 0.9% |
| QLogic | 52 | 0.9% |
| Analog Devices | 49 | 0.8% |
|
| By lines changed |
| (Unknown) | 130164 | 21.6% |
| Red Hat | 104627 | 17.4% |
| Analog Devices | 84561 | 14.0% |
| Novell | 41366 | 6.9% |
| IBM | 33629 | 5.6% |
| Astaro | 22065 | 3.7% |
| (None) | 20097 | 3.3% |
| (Consultant) | 15403 | 2.6% |
| Linutronix | 13585 | 2.3% |
| Intel | 12288 | 2.0% |
| Cisco | 12280 | 2.0% |
| Oracle | 10482 | 1.7% |
| Freescale | 10116 | 1.7% |
| SGI | 8639 | 1.4% |
| Nokia | 7328 | 1.2% |
| SANPeople | 7045 | 1.2% |
| Broadcom | 5952 | 1.0% |
| MontaVista | 5810 | 1.0% |
| Linux Foundation | 5746 | 1.0% |
| Atmel | 5220 | 0.9% |
|
One thing which jumps out here is that the amount of code contributed by
developers known to be working on their own time has dropped; 2.6.22 will
be one of the most corporate kernels yet.
Looking at the developers who put Signed-off-by lines onto patches yields
some interesting results. If one tabulates all 12,678 signoffs in 2.6.22,
the results look like this:
| Developers with the most signoffs (total 12678) |
| Andrew Morton | 1415 | 11.2% |
| Linus Torvalds | 1299 | 10.2% |
| David S. Miller | 814 | 6.4% |
| Paul Mackerras | 381 | 3.0% |
| Jeff Garzik | 344 | 2.7% |
| Andi Kleen | 252 | 2.0% |
| Greg Kroah-Hartman | 236 | 1.9% |
| Mauro Carvalho Chehab | 236 | 1.9% |
| Stefan Richter | 210 | 1.7% |
| Russell King | 189 | 1.5% |
| James Bottomley | 176 | 1.4% |
| Jaroslav Kysela | 145 | 1.1% |
| Takashi Iwai | 131 | 1.0% |
| Len Brown | 126 | 1.0% |
| Kristian Høgsberg | 126 | 1.0% |
| Patrick McHardy | 117 | 0.9% |
| Jean Delvare | 110 | 0.9% |
| Roland Dreier | 109 | 0.9% |
| Antonino Daplas | 106 | 0.8% |
| Dmitry Torokhov | 105 | 0.8% |
All authors must sign off on their code. Additionally, any maintainer who
passes a patch up toward the mainline adds a signoff indicating that he or
she believes the code is legitimate and suitable for inclusion. If one
excludes signoffs by the author of each patch, the remaining 7,000 signoffs
are (almost) all by people through whom the code has passed (a few of them
are by additional authors of the patch). Those adding
non-author signoffs can thus be thought of as the gatekeepers through whom
each patch must pass. Non-author signoffs break down like this:
| Non-author signoffs (total 7028) |
| Andrew Morton | 1336 | 19.0% |
| Linus Torvalds | 1279 | 18.2% |
| David S. Miller | 640 | 9.1% |
| Paul Mackerras | 371 | 5.3% |
| Jeff Garzik | 322 | 4.6% |
| Greg Kroah-Hartman | 222 | 3.2% |
| Mauro Carvalho Chehab | 216 | 3.1% |
| Andi Kleen | 193 | 2.7% |
| James Bottomley | 163 | 2.3% |
| Jaroslav Kysela | 142 | 2.0% |
| Russell King | 132 | 1.9% |
| Stefan Richter | 131 | 1.9% |
| Len Brown | 115 | 1.6% |
| John W. Linville | 85 | 1.2% |
| Roland Dreier | 85 | 1.2% |
| Takashi Iwai | 79 | 1.1% |
| Martin Schwidefsky | 54 | 0.8% |
| David Woodhouse | 53 | 0.8% |
| Ralf Baechle | 48 | 0.7% |
| Antonino Daplas | 48 | 0.7% |
In summary, 80% of the patches merged into the mainline kernel passed
through the twenty developers listed above. One can take another step, and
look at the number of non-author signoffs by employer:
| Non-author signoffs by employer |
| Google | 1338 | 19.0% |
| Linux Foundation | 1281 | 18.2% |
| Red Hat | 1246 | 17.7% |
| Novell | 700 | 10.0% |
| (Unknown) | 660 | 9.4% |
| IBM | 553 | 7.9% |
| (None) | 293 | 4.2% |
| Intel | 193 | 2.7% |
| SteelEye | 163 | 2.3% |
| Cisco | 85 | 1.2% |
| MIPS Technologies | 48 | 0.7% |
| Nokia | 42 | 0.6% |
| Astaro | 41 | 0.6% |
| Analog Devices | 35 | 0.5% |
| QLogic | 35 | 0.5% |
| Cendio | 32 | 0.5% |
| SGI | 28 | 0.4% |
| NetApp | 28 | 0.4% |
| (Consultant) | 23 | 0.3% |
| Oracle | 22 | 0.3% |
The bottom line: while Linux kernel development is a highly distributed
activity, the work of several hundred developers is channeled through a
surprisingly small number of individuals, and an even smaller number of
companies on its way into the mainline.
Comments (10 posted)
In
last week's episode, the
kernel developers were considering the addition of a couple of flags to the
open() system call; these flags would allow applications to select
previously unavailable features like the non-sequential file descriptor
range or immediate close-on-exec behavior. The problem that comes up
quickly is that
open() is just one of many system calls which
creates file descriptors; most of the others do not have a parameter which
allows an application to pass a set of accompanying flags. So it is not
possible to request, for example, the non-sequential behavior when
obtaining a file descriptor with
socket(),
pipe(),
epoll_create(),
timerfd(),
signalfd(),
accept(), and so on.
In the second version of the
non-sequential file descriptor patch, Davide Libenzi attempted to
address part of the problem by adding a
socket2() system call with an added "flags" parameter. That
was enough to frighten a number of developers; nobody really wants to see a
big expansion of the system call list resulting from the addition of
variations on all the file-descriptor-creating calls. Another approach, it
seems, is required, but finding that approach is not entirely easy.
One possibility is to simply ignore the problem; not everybody is sold on
the need for non-sequential file descriptors or immediate close-on-exec
behavior. There are enough people who see a problem here to motivate some
sort of solution, though. Ulrich Drepper, the glibc maintainer, has seen
enough applications to conclude that the issue is real.
An alternative, suggested by Alan Cox, is
to create a process state flag which controls the use of these features.
So a call like:
prctl(PR_SPARSEFD, 1);
would turn on non-sequential file descriptor allocation for all system
calls made by the calling process. The problem here is that the
lowest-available-descriptor behavior is a documented part of the POSIX
binary interface. A process could waive that guarantee for itself, but it
will always be hard to know that all libraries used by that process are
safe in the absence of that behavior. One library might want to use
non-sequential file descriptors, but that library cannot safely turn them
on for the whole process without risking the creation of difficult bugs in
obscure situations. It has been suggested that linker tricks could be used
to avoid bringing older libraries, but Ulrich feels that people would respond by simply
recompiling the older libraries and the potential bugs would remain.
Linus came into the discussion with a
statement that neither adding a bunch of new system calls nor the global
flag were acceptable. Instead, he came up with a completely different
idea: create a mechanism which allows a single system call to be invoked
with a specific set of flags. His proposed interface is:
int syscall_indirect(unsigned long flags, sigset_t sigmask,
int syscall, unsigned long args[6]);
The result would be a call to the given system call with the requested
arguments. For the duration of the call, the given flags would be
in effect, and signals in sigmask would be blocked. Even before
adding any flags, this mechanism could be used to implement the series of
system calls (pselect(), for example) which exists only to apply a
signal mask to an earlier version of the call. Then the non-sequential
file descriptor and close-on-exec behavior could be requested via the
flags argument. Beyond that, flags could be added to control the
handling of symbolic links, and various other things. Matt Mackall
suggested that the "syslet" mechanism could be implemented as a "run this
call asynchronously" flag.
This approach is not without its potential problems. There are worries
that the flags bits could be quickly exhausted, once again making
it hard to add options to existing system calls. Linus suggests overloading the flag bits as a way of
making them last longer. That approach risks problems if application
developers attempt to apply the wrong flags for a given system call - there would
be no automatic way of catching such errors - but it is unlikely that
applications would be calling syscall_indirect() themselves, so
this risk is relatively small. It is appropriate to worry about
whether any conceivable, sensible behavior modification is covered by this
interface, or whether it needs a different set of parameters. And one
might well wonder whether, some years from now, a large percentage of
system calls will be made via syscall_indirect().
This new system call suffers from one other shortcoming as well: there is
currently no working implementation. That will likely change at some
point, leading to a wider discussion of the proposed interface. If it
still seems like a good idea, we might just have a way of adding new
behavior to old functions without an explosion in the number of system
calls. Sometimes, perhaps, it really is true that problems in computer
science are best solved through the addition of another level of indirection.
Comments (8 posted)
June 12, 2007
This article was contributed by Valerie Henson
At this year's USENIX File
Systems and Storage Technology Conference, we were treated to two
papers studying failure rates in disk populations numbering over
100,000. These kinds of data sets are hard to get - first you have to
have 100,000 disks, then you have to record failure-related data
faithfully for years on end, and then you have to release the data in
a form that doesn't get anyone sued. The storage community has
salivated after this kind of real-world data for years, and now we
have not one, but two (!) long-term studies of disk failure rates. The
conference hall was packed during these two presentations. When the
talks were done, we stumbled out into the hallway, dazed and excited
by the many surprising results. Heat is negatively correlated with
failure! Failures show short AND long-term correlation! SMART errors
do mean the drive is more likely to fail, but a third of drives die
with no warning at all! The size of the data sets, the quality of
analysis, and the non-intuitive results win these two papers a place
on the Kernel Hacker's Bookshelf.
The first paper (and winner of Best Paper), was Disk failures
in the real world: What does an MTTF of 1,000,000 hours mean to
you?, by Bianca Schroeder and Garth Gibson. They reviewed failure
data from a collection of 100,000 disks, over a period of up to 5
years. The disks were part of a variety of HPC clusters and an
Internet service provider. Disk failure was defined as the disk being
replaced. The date of replacement was also used as the date of the
failure, since determining exactly when a disk failed was not
possible.
Their first major result was that the real-world annualized failure
rate (average percentage of disks failing per year) was
much higher than the manufacturer's estimate - an
average of 3% vs. the estimated 0.5 - 0.9%. Disk manufacturers
obviously can't test disks for a year before shipping them, so they
stress test disks in high-temperature, high-vibration, high-workload
environments, and use data from previous models to estimate MTTF.
Only one set of disks had a real-world failure rate less than the
estimated failure rate, and one set of disks had a 13.5% annualized
failure rate!
More surprisingly, they found no correlation between failure rate and
disk type - SCSI, SATA, or fiber channel. The most reliable disk set
was composed of only SATA drives, which are commonly regarded to be
less reliable than SCSI or fibre channel.
In another surprise, they debunked the "bathtub model" of disk failure
rates. In this theory, disks experience a higher "infant mortality"
initial rate of failure, then settle down for a few years of low
failure rate, and then begin to wear out and fail. The graph of the
probability vs. time looks like a bathtub, flat in the middle and
sloping up at the ends. Instead, the real-world failure rate began
low and steadily increased over the years. Disks don't have a sweet
spot of low failure rate.
Failures within a batch of disks were strongly correlated over both
short and long time periods. If a disk had failed in a batch, then
there was a significant probability of a second failure up to at least
2 years later. If one disk in your batch has just gone, you are more
likely to have another disk failure in the same batch. Scary news for
RAID arrays with disks from the same batch. A recent paper in the 2006 Storage Security and
Survivability Workshop, Using
Device Diversity to Protect Data against Batch-Correlated Disk
Failures, by Jehan-François Pâris and Darrell D. E. Long,
calculated the increase in RAID reliability from mixing batches of
disks. Using more than one kind of disk increases costs, but with the
combination of data from these two papers, RAID users can calculate
the value of the extra reliability and make the most economical
decision.
The second paper, Failure Trends
in a Large Disk Drive Population, by Eduardo Pinheiro,
Wolf-Dietrich Weber and Luiz Andrè Barroso, reports on disk
failure rates at Google. They used a Google tool for recording system
health parameters and many other staples of Google software
(Mapreduce, Bigtable, etc.) to collect and analyze the data. They
focused on SMART statistics - the built-in disk drive monitoring in
many modern disk drives, which records statistics about scan errors
and blocks relocated.
The first result agrees with the first paper: The annualized failure
rate was much higher than estimated, between 1.7% and 8.6%. They next
looked for correlation between failure rate and drive utilization (as
estimated by the amount of data read or written to the drive). They
find a much weaker correlation between higher utilization and failure
rate than expected, with low utilization disks often having higher
failure rates than medium utilization disks, and, in the case of the
3-year-old vintage of disks, higher than the high utilization group.
Now for the most surprising result. In Google's population of cheap
ATA disks, high temperature was negatively correlated
with failure! In the authors' words:
In fact, there is a clear trend showing that lower temperatures are
associated with higher failure rates. Only at very high temperatures
is there a slight reversal of this trend.
This correlation held true over a temperature range of 17-55 C. Only
in the 3-year-old disk population was there correlation between high
temperatures and failure rates. My completely unsupported and
untested hypothesis is that drive manufacturers stress test their
drives in high temperature environments to simulate longer wear.
Perhaps they have unwittingly designed drives that work better in
their high-temperature test environment at the expense of a more
typical low-temperature field environment.
Finally, they looked at the SMART data gathered from the drives.
Overall, any kind of SMART error correlated strongly with disk
failure. A scan error occurs when the disk checks data in the
background, reading the entire disk. Within 8 months of the first
scan error, about 30% of drives would fail completely. A reallocation
error occurs when a block can't be written, and the block is
reassigned to another location on disk. A reallocation error resulted
in about 15% of affected drives failing with 8 months. On the other
hand, 36% of the drives that failed had no warning whatsoever, either
from SMART errors or from exceptionally high temperatures.
For Google's purposes, the predictive power of SMART is of limited
utility. Replacing every disk that had a SMART error would end
up replacing good disks that will run for years to come about 70% of the
time. For Google, this isn't cost-effective, since all their data is
replicated several times. But for an individual user for whom losing
their disk is a disaster, replacing the disk at the first sign of a
SMART error makes eminent sense. I have personally had two laptop
drives start spitting SMART errors in time to get my data off the disk
before it died completely.
Overall, these are two exciting papers with long-awaited real-world
failure data on large disk populations. We should expect to see more
publications analyzing these data sets in the years to come.
Valerie Henson is a Linux file systems consultant specializing in file
system check and repair.
Comments (22 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>