Kernel development
Brief items
Kernel release status
The current 2.6 prepatch remains 2.6.22-rc4. Patches continue to flow into the mainline repository; they are mostly fixes, but the ZERO_SIZE_PTR patch for the SLUB allocator has also gone in.The current -mm tree is 2.6.22-rc4-mm2. Recent changes to -mm are almost all fixes aimed at stabilizing this tree somewhat.
The current stable 2.6 kernel is 2.6.21.5, released on June 11 with a
rather long list of fixes. 2.6.21.4 was released on
June 8 with a set of security fixes: "The /dev/[u]random fix is especially important for machines with no
entropy source (e.g. keyboard, mice, or disk drives) and no realtime clock
since successive boots could generate same output from RNG. The cpuset
bug is a possible information leak when reading from /dev/cpuset/tasks
(assuming cpusets support is compiled in and the cpuset fs mounted
on /dev/cpuset). The SCTP bug is remotely triggerable when using SCTP
conntrack.
"
For older kernels: 2.6.20.13 was released on June 8 with the same security fixes; it was followed by 2.6.20.14 (June 11), which contained a large assortment of patches.
2.4.34.5 was released on June 6 with a small set of fixes. The 2.4.35 process continues with 2.4.35-pre5, also released on the 6th.
Kernel development news
Quotes of the week
/* I'm told there are only two stories in the world worth telling: love * and hate. So there used to be a love scene here like this: * * Launcher: We could make beautiful I/O together, you and I. * Guest: My, that's a big disk! * * Unfortunately, it was just too raunchy for our otherwise-gentle tale. */
Linus on GPLv3 and ZFS
For the curious, here's a recent posting from Linus Torvalds on Sun's motivations and GPLv3. "So to Sun, a GPLv3-only release would actually let them look good, and still keep Linux from taking their interesting parts, and would allow them to take at least parts of Linux without giving anything back (ahh, the joys of license fragmentation). Of course, they know that. And yes, maybe ZFS is worthwhile enough that I'm willing to go to the effort of trying to relicense the kernel. But quite frankly, I can almost guarantee that Sun won't release ZFS under the GPLv3 even if they release other parts. Because if they did, they'd lose the patent protection."
R500 initial driver release
Support for ATI R500 graphics chipsets has been one of the biggest missing pieces from the Linux free driver collection. That has just changed with the release of an early driver for R500 chipsets written from reverse-engineered specs. The driver only does 2D for now, but 3D support is in the works. Unsurprisingly, the development team would like help in getting this driver ready for production use. This release is an important step forward; congratulations are due to the developers who have brought this work this far.Who wrote - and approved - 2.6.22
The 2.6.22 kernel is getting closer to its final state with its official release likely to happen near the end of this month. Patches are still being added to the mainline repository, but things have stabilized enough that it makes sense to take a look at where the code came from this time around. Accordingly, your editor has fixed up his scripts and cranked through the changesets added in this kernel development cycle.As of this writing, just over 6,000 changesets have been accepted for 2.6.22. Those patches were contributed by 885 different developers, added 494,000 lines, and deleted 241,000 other lines (without counting renames, which would otherwise increase both numbers by about 60,000 lines). That makes 2.6.22 a large change relative to its immediate predecessors:
Release Developers Changesets Lines
addedLines
removed2.6.20 741 4983 286,000 160,000 2.6.21 842 5349 343,000 199,000 2.6.22-rc4+ 885 6093 494,000 241,000
Here's the top contributors of those changes:
Most active 2.6.22 developers
By changesets David S. Miller 175 3.0% Kristian Høgsberg 109 1.9% Stephen Hemminger 86 1.5% Arnaldo Carvalho de Melo 82 1.4% Andrew Morton 79 1.3% Stefan Richter 79 1.3% Christoph Lameter 77 1.3% Patrick McHardy 76 1.3% Jean Delvare 75 1.3% Dmitry Torokhov 70 1.2% Stephen Rothwell 68 1.2% Paul Mundt 66 1.1% David Brownell 65 1.1% Jeff Dike 63 1.1% Alan Cox 60 1.0% Andi Kleen 59 1.0% Antonino Daplas 58 1.0% Adrian Bunk 58 1.0% Tejun Heo 57 1.0% Russell King 57 1.0%
By changed lines Bryan Wu 77594 12.9% David Howells 23310 3.9% Marcelo Tosatti 22351 3.7% Patrick McHardy 21746 3.6% Jiri Benc 18328 3.0% Hans Verkuil 13683 2.3% David S. Miller 13595 2.3% Roland Dreier 12247 2.0% Artem B. Bityutskiy 12065 2.0% Kristian Høgsberg 11153 1.9% Robert P. J. Day 7554 1.3% Christoph Lameter 7378 1.2% Andrew Victor 6638 1.1% Mike Frysinger 6313 1.0% David Brownell 6033 1.0% Michael Chan 5851 1.0% Andi Kleen 5431 0.9% David Gibson 5321 0.9% Nobuhiro Iwamatsu 5296 0.9% Mark Fasheh 4921 0.8%
Bryan Wu makes it to the top of the list of contributors (by lines changed) by virtue of being the person to contribute support for the Blackfin architecture. David Howells contributed the AF_RXRPC and AFS filesystem work; Marcelo Tosatti wrote the OLPC "Libertas" wireless driver, and Jiri Benc's name appears on the mac80211 stack.
When broken down by employer, the (approximate, as always) numbers come out like this:
Most active 2.6.22 employers
By changesets (Unknown) 1766 30.2% Red Hat 720 12.3% IBM 601 10.3% Novell 411 7.0% (None) 245 4.2% Intel 203 3.5% Oracle 127 2.2% (Consultant) 119 2.0% Linux Foundation 116 2.0% 111 1.9% SGI 93 1.6% Nokia 83 1.4% Freescale 80 1.4% Astaro 76 1.3% XenSource 56 1.0% MontaVista 56 1.0% Qumranet 55 0.9% HP 53 0.9% QLogic 52 0.9% Analog Devices 49 0.8%
By lines changed (Unknown) 130164 21.6% Red Hat 104627 17.4% Analog Devices 84561 14.0% Novell 41366 6.9% IBM 33629 5.6% Astaro 22065 3.7% (None) 20097 3.3% (Consultant) 15403 2.6% Linutronix 13585 2.3% Intel 12288 2.0% Cisco 12280 2.0% Oracle 10482 1.7% Freescale 10116 1.7% SGI 8639 1.4% Nokia 7328 1.2% SANPeople 7045 1.2% Broadcom 5952 1.0% MontaVista 5810 1.0% Linux Foundation 5746 1.0% Atmel 5220 0.9%
One thing which jumps out here is that the amount of code contributed by developers known to be working on their own time has dropped; 2.6.22 will be one of the most corporate kernels yet.
Looking at the developers who put Signed-off-by lines onto patches yields some interesting results. If one tabulates all 12,678 signoffs in 2.6.22, the results look like this:
Developers with the most signoffs (total 12678) Andrew Morton 1415 11.2% Linus Torvalds 1299 10.2% David S. Miller 814 6.4% Paul Mackerras 381 3.0% Jeff Garzik 344 2.7% Andi Kleen 252 2.0% Greg Kroah-Hartman 236 1.9% Mauro Carvalho Chehab 236 1.9% Stefan Richter 210 1.7% Russell King 189 1.5% James Bottomley 176 1.4% Jaroslav Kysela 145 1.1% Takashi Iwai 131 1.0% Len Brown 126 1.0% Kristian Høgsberg 126 1.0% Patrick McHardy 117 0.9% Jean Delvare 110 0.9% Roland Dreier 109 0.9% Antonino Daplas 106 0.8% Dmitry Torokhov 105 0.8%
All authors must sign off on their code. Additionally, any maintainer who passes a patch up toward the mainline adds a signoff indicating that he or she believes the code is legitimate and suitable for inclusion. If one excludes signoffs by the author of each patch, the remaining 7,000 signoffs are (almost) all by people through whom the code has passed (a few of them are by additional authors of the patch). Those adding non-author signoffs can thus be thought of as the gatekeepers through whom each patch must pass. Non-author signoffs break down like this:
Non-author signoffs (total 7028) Andrew Morton 1336 19.0% Linus Torvalds 1279 18.2% David S. Miller 640 9.1% Paul Mackerras 371 5.3% Jeff Garzik 322 4.6% Greg Kroah-Hartman 222 3.2% Mauro Carvalho Chehab 216 3.1% Andi Kleen 193 2.7% James Bottomley 163 2.3% Jaroslav Kysela 142 2.0% Russell King 132 1.9% Stefan Richter 131 1.9% Len Brown 115 1.6% John W. Linville 85 1.2% Roland Dreier 85 1.2% Takashi Iwai 79 1.1% Martin Schwidefsky 54 0.8% David Woodhouse 53 0.8% Ralf Baechle 48 0.7% Antonino Daplas 48 0.7%
In summary, 80% of the patches merged into the mainline kernel passed through the twenty developers listed above. One can take another step, and look at the number of non-author signoffs by employer:
Non-author signoffs by employer 1338 19.0% Linux Foundation 1281 18.2% Red Hat 1246 17.7% Novell 700 10.0% (Unknown) 660 9.4% IBM 553 7.9% (None) 293 4.2% Intel 193 2.7% SteelEye 163 2.3% Cisco 85 1.2% MIPS Technologies 48 0.7% Nokia 42 0.6% Astaro 41 0.6% Analog Devices 35 0.5% QLogic 35 0.5% Cendio 32 0.5% SGI 28 0.4% NetApp 28 0.4% (Consultant) 23 0.3% Oracle 22 0.3%
The bottom line: while Linux kernel development is a highly distributed activity, the work of several hundred developers is channeled through a surprisingly small number of individuals, and an even smaller number of companies on its way into the mainline.
More fun with file descriptors
In last week's episode, the kernel developers were considering the addition of a couple of flags to the open() system call; these flags would allow applications to select previously unavailable features like the non-sequential file descriptor range or immediate close-on-exec behavior. The problem that comes up quickly is that open() is just one of many system calls which creates file descriptors; most of the others do not have a parameter which allows an application to pass a set of accompanying flags. So it is not possible to request, for example, the non-sequential behavior when obtaining a file descriptor with socket(), pipe(), epoll_create(), timerfd(), signalfd(), accept(), and so on.In the second version of the non-sequential file descriptor patch, Davide Libenzi attempted to address part of the problem by adding a socket2() system call with an added "flags" parameter. That was enough to frighten a number of developers; nobody really wants to see a big expansion of the system call list resulting from the addition of variations on all the file-descriptor-creating calls. Another approach, it seems, is required, but finding that approach is not entirely easy.
One possibility is to simply ignore the problem; not everybody is sold on the need for non-sequential file descriptors or immediate close-on-exec behavior. There are enough people who see a problem here to motivate some sort of solution, though. Ulrich Drepper, the glibc maintainer, has seen enough applications to conclude that the issue is real.
An alternative, suggested by Alan Cox, is to create a process state flag which controls the use of these features. So a call like:
prctl(PR_SPARSEFD, 1);
would turn on non-sequential file descriptor allocation for all system calls made by the calling process. The problem here is that the lowest-available-descriptor behavior is a documented part of the POSIX binary interface. A process could waive that guarantee for itself, but it will always be hard to know that all libraries used by that process are safe in the absence of that behavior. One library might want to use non-sequential file descriptors, but that library cannot safely turn them on for the whole process without risking the creation of difficult bugs in obscure situations. It has been suggested that linker tricks could be used to avoid bringing older libraries, but Ulrich feels that people would respond by simply recompiling the older libraries and the potential bugs would remain.
Linus came into the discussion with a statement that neither adding a bunch of new system calls nor the global flag were acceptable. Instead, he came up with a completely different idea: create a mechanism which allows a single system call to be invoked with a specific set of flags. His proposed interface is:
int syscall_indirect(unsigned long flags, sigset_t sigmask,
int syscall, unsigned long args[6]);
The result would be a call to the given system call with the requested arguments. For the duration of the call, the given flags would be in effect, and signals in sigmask would be blocked. Even before adding any flags, this mechanism could be used to implement the series of system calls (pselect(), for example) which exists only to apply a signal mask to an earlier version of the call. Then the non-sequential file descriptor and close-on-exec behavior could be requested via the flags argument. Beyond that, flags could be added to control the handling of symbolic links, and various other things. Matt Mackall suggested that the "syslet" mechanism could be implemented as a "run this call asynchronously" flag.
This approach is not without its potential problems. There are worries that the flags bits could be quickly exhausted, once again making it hard to add options to existing system calls. Linus suggests overloading the flag bits as a way of making them last longer. That approach risks problems if application developers attempt to apply the wrong flags for a given system call - there would be no automatic way of catching such errors - but it is unlikely that applications would be calling syscall_indirect() themselves, so this risk is relatively small. It is appropriate to worry about whether any conceivable, sensible behavior modification is covered by this interface, or whether it needs a different set of parameters. And one might well wonder whether, some years from now, a large percentage of system calls will be made via syscall_indirect().
This new system call suffers from one other shortcoming as well: there is currently no working implementation. That will likely change at some point, leading to a wider discussion of the proposed interface. If it still seems like a good idea, we might just have a way of adding new behavior to old functions without an explosion in the number of system calls. Sometimes, perhaps, it really is true that problems in computer science are best solved through the addition of another level of indirection.
KHB: Real-world disk failure rates: surprises, surprises, and more surprises
At this year's USENIX File Systems and Storage Technology Conference, we were treated to two papers studying failure rates in disk populations numbering over 100,000. These kinds of data sets are hard to get - first you have to have 100,000 disks, then you have to record failure-related data faithfully for years on end, and then you have to release the data in a form that doesn't get anyone sued. The storage community has salivated after this kind of real-world data for years, and now we have not one, but two (!) long-term studies of disk failure rates. The conference hall was packed during these two presentations. When the talks were done, we stumbled out into the hallway, dazed and excited by the many surprising results. Heat is negatively correlated with failure! Failures show short AND long-term correlation! SMART errors do mean the drive is more likely to fail, but a third of drives die with no warning at all! The size of the data sets, the quality of analysis, and the non-intuitive results win these two papers a place on the Kernel Hacker's Bookshelf.
The first paper (and winner of Best Paper), was Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?, by Bianca Schroeder and Garth Gibson. They reviewed failure data from a collection of 100,000 disks, over a period of up to 5 years. The disks were part of a variety of HPC clusters and an Internet service provider. Disk failure was defined as the disk being replaced. The date of replacement was also used as the date of the failure, since determining exactly when a disk failed was not possible.
Their first major result was that the real-world annualized failure rate (average percentage of disks failing per year) was much higher than the manufacturer's estimate - an average of 3% vs. the estimated 0.5 - 0.9%. Disk manufacturers obviously can't test disks for a year before shipping them, so they stress test disks in high-temperature, high-vibration, high-workload environments, and use data from previous models to estimate MTTF. Only one set of disks had a real-world failure rate less than the estimated failure rate, and one set of disks had a 13.5% annualized failure rate!
More surprisingly, they found no correlation between failure rate and disk type - SCSI, SATA, or fiber channel. The most reliable disk set was composed of only SATA drives, which are commonly regarded to be less reliable than SCSI or fibre channel.
In another surprise, they debunked the "bathtub model" of disk failure rates. In this theory, disks experience a higher "infant mortality" initial rate of failure, then settle down for a few years of low failure rate, and then begin to wear out and fail. The graph of the probability vs. time looks like a bathtub, flat in the middle and sloping up at the ends. Instead, the real-world failure rate began low and steadily increased over the years. Disks don't have a sweet spot of low failure rate.
Failures within a batch of disks were strongly correlated over both short and long time periods. If a disk had failed in a batch, then there was a significant probability of a second failure up to at least 2 years later. If one disk in your batch has just gone, you are more likely to have another disk failure in the same batch. Scary news for RAID arrays with disks from the same batch. A recent paper in the 2006 Storage Security and Survivability Workshop, Using Device Diversity to Protect Data against Batch-Correlated Disk Failures, by Jehan-François Pâris and Darrell D. E. Long, calculated the increase in RAID reliability from mixing batches of disks. Using more than one kind of disk increases costs, but with the combination of data from these two papers, RAID users can calculate the value of the extra reliability and make the most economical decision.
The second paper, Failure Trends in a Large Disk Drive Population, by Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andrè Barroso, reports on disk failure rates at Google. They used a Google tool for recording system health parameters and many other staples of Google software (Mapreduce, Bigtable, etc.) to collect and analyze the data. They focused on SMART statistics - the built-in disk drive monitoring in many modern disk drives, which records statistics about scan errors and blocks relocated.
The first result agrees with the first paper: The annualized failure rate was much higher than estimated, between 1.7% and 8.6%. They next looked for correlation between failure rate and drive utilization (as estimated by the amount of data read or written to the drive). They find a much weaker correlation between higher utilization and failure rate than expected, with low utilization disks often having higher failure rates than medium utilization disks, and, in the case of the 3-year-old vintage of disks, higher than the high utilization group.
Now for the most surprising result. In Google's population of cheap ATA disks, high temperature was negatively correlated with failure! In the authors' words:
This correlation held true over a temperature range of 17-55 C. Only in the 3-year-old disk population was there correlation between high temperatures and failure rates. My completely unsupported and untested hypothesis is that drive manufacturers stress test their drives in high temperature environments to simulate longer wear. Perhaps they have unwittingly designed drives that work better in their high-temperature test environment at the expense of a more typical low-temperature field environment.
Finally, they looked at the SMART data gathered from the drives. Overall, any kind of SMART error correlated strongly with disk failure. A scan error occurs when the disk checks data in the background, reading the entire disk. Within 8 months of the first scan error, about 30% of drives would fail completely. A reallocation error occurs when a block can't be written, and the block is reassigned to another location on disk. A reallocation error resulted in about 15% of affected drives failing with 8 months. On the other hand, 36% of the drives that failed had no warning whatsoever, either from SMART errors or from exceptionally high temperatures.
For Google's purposes, the predictive power of SMART is of limited utility. Replacing every disk that had a SMART error would end up replacing good disks that will run for years to come about 70% of the time. For Google, this isn't cost-effective, since all their data is replicated several times. But for an individual user for whom losing their disk is a disaster, replacing the disk at the first sign of a SMART error makes eminent sense. I have personally had two laptop drives start spitting SMART errors in time to get my data off the disk before it died completely.
Overall, these are two exciting papers with long-awaited real-world failure data on large disk populations. We should expect to see more publications analyzing these data sets in the years to come.
Valerie Henson is a Linux file systems consultant specializing in file system check and repair.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
