|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch remains 2.6.22-rc4. Patches continue to flow into the mainline repository; they are mostly fixes, but the ZERO_SIZE_PTR patch for the SLUB allocator has also gone in.

The current -mm tree is 2.6.22-rc4-mm2. Recent changes to -mm are almost all fixes aimed at stabilizing this tree somewhat.

The current stable 2.6 kernel is 2.6.21.5, released on June 11 with a rather long list of fixes. 2.6.21.4 was released on June 8 with a set of security fixes: "The /dev/[u]random fix is especially important for machines with no entropy source (e.g. keyboard, mice, or disk drives) and no realtime clock since successive boots could generate same output from RNG. The cpuset bug is a possible information leak when reading from /dev/cpuset/tasks (assuming cpusets support is compiled in and the cpuset fs mounted on /dev/cpuset). The SCTP bug is remotely triggerable when using SCTP conntrack."

For older kernels: 2.6.20.13 was released on June 8 with the same security fixes; it was followed by 2.6.20.14 (June 11), which contained a large assortment of patches.

2.4.34.5 was released on June 6 with a small set of fixes. The 2.4.35 process continues with 2.4.35-pre5, also released on the 6th.

Comments (none posted)

Kernel development news

Quotes of the week

The overall quality of 2.6.21 is pretty horrific. It saw the introduction of a lot of new code fundamental to the operation of the kernel (the tickless stuff for eg), massive updates to areas such as ACPI, and just to mix things up, we switched from a known-crap-but-tried-and-tested IDE system to a-bleeding-edge-but-hopefully-with-signs-of-promise libata based system. Lots of changes == lots of fallout the first time it goes into a production OS.
-- Dave Jones

What I am objecting to is this idea that many kernel developers seem to have, that if there is some aspect of the kernel/user API that becomes a bit inconvenient for the kernel to implement, then we can put the blame on the applications that rely on that aspect, call them names such as "legacy", "abuser", "conceptually buggy", "broken", etc., and ultimately justify breaking the ABI -- since it's only those applications that we have demonised that will be affected, after all.
-- Paul Mackerras

/* I'm told there are only two stories in the world worth telling: love
 * and hate.  So there used to be a love scene here like this:
 *
 *  Launcher:	We could make beautiful I/O together, you and I.
 *  Guest:	My, that's a big disk!
 *
 * Unfortunately, it was just too raunchy for our otherwise-gentle tale.
 */
-- Rusty Russell gets into literate programming

Comments (2 posted)

Linus on GPLv3 and ZFS

For the curious, here's a recent posting from Linus Torvalds on Sun's motivations and GPLv3. "So to Sun, a GPLv3-only release would actually let them look good, and still keep Linux from taking their interesting parts, and would allow them to take at least parts of Linux without giving anything back (ahh, the joys of license fragmentation). Of course, they know that. And yes, maybe ZFS is worthwhile enough that I'm willing to go to the effort of trying to relicense the kernel. But quite frankly, I can almost guarantee that Sun won't release ZFS under the GPLv3 even if they release other parts. Because if they did, they'd lose the patent protection."

Full Story (comments: 50)

R500 initial driver release

Support for ATI R500 graphics chipsets has been one of the biggest missing pieces from the Linux free driver collection. That has just changed with the release of an early driver for R500 chipsets written from reverse-engineered specs. The driver only does 2D for now, but 3D support is in the works. Unsurprisingly, the development team would like help in getting this driver ready for production use. This release is an important step forward; congratulations are due to the developers who have brought this work this far.

Full Story (comments: 29)

Who wrote - and approved - 2.6.22

The 2.6.22 kernel is getting closer to its final state with its official release likely to happen near the end of this month. Patches are still being added to the mainline repository, but things have stabilized enough that it makes sense to take a look at where the code came from this time around. Accordingly, your editor has fixed up his scripts and cranked through the changesets added in this kernel development cycle.

As of this writing, just over 6,000 changesets have been accepted for 2.6.22. Those patches were contributed by 885 different developers, added 494,000 lines, and deleted 241,000 other lines (without counting renames, which would otherwise increase both numbers by about 60,000 lines). That makes 2.6.22 a large change relative to its immediate predecessors:

ReleaseDevelopersChangesets Lines
added
Lines
removed
2.6.207414983286,000 160,000
2.6.218425349343,000 199,000
2.6.22-rc4+8856093 494,000241,000

Here's the top contributors of those changes:

Most active 2.6.22 developers
By changesets
David S. Miller1753.0%
Kristian Høgsberg1091.9%
Stephen Hemminger861.5%
Arnaldo Carvalho de Melo821.4%
Andrew Morton791.3%
Stefan Richter791.3%
Christoph Lameter771.3%
Patrick McHardy761.3%
Jean Delvare751.3%
Dmitry Torokhov701.2%
Stephen Rothwell681.2%
Paul Mundt661.1%
David Brownell651.1%
Jeff Dike631.1%
Alan Cox601.0%
Andi Kleen591.0%
Antonino Daplas581.0%
Adrian Bunk581.0%
Tejun Heo571.0%
Russell King571.0%
By changed lines
Bryan Wu7759412.9%
David Howells233103.9%
Marcelo Tosatti223513.7%
Patrick McHardy217463.6%
Jiri Benc183283.0%
Hans Verkuil136832.3%
David S. Miller135952.3%
Roland Dreier122472.0%
Artem B. Bityutskiy120652.0%
Kristian Høgsberg111531.9%
Robert P. J. Day75541.3%
Christoph Lameter73781.2%
Andrew Victor66381.1%
Mike Frysinger63131.0%
David Brownell60331.0%
Michael Chan58511.0%
Andi Kleen54310.9%
David Gibson53210.9%
Nobuhiro Iwamatsu52960.9%
Mark Fasheh49210.8%

Bryan Wu makes it to the top of the list of contributors (by lines changed) by virtue of being the person to contribute support for the Blackfin architecture. David Howells contributed the AF_RXRPC and AFS filesystem work; Marcelo Tosatti wrote the OLPC "Libertas" wireless driver, and Jiri Benc's name appears on the mac80211 stack.

When broken down by employer, the (approximate, as always) numbers come out like this:

Most active 2.6.22 employers
By changesets
(Unknown)176630.2%
Red Hat72012.3%
IBM60110.3%
Novell4117.0%
(None)2454.2%
Intel2033.5%
Oracle1272.2%
(Consultant)1192.0%
Linux Foundation1162.0%
Google1111.9%
SGI931.6%
Nokia831.4%
Freescale801.4%
Astaro761.3%
XenSource561.0%
MontaVista561.0%
Qumranet550.9%
HP530.9%
QLogic520.9%
Analog Devices490.8%
By lines changed
(Unknown)13016421.6%
Red Hat10462717.4%
Analog Devices8456114.0%
Novell413666.9%
IBM336295.6%
Astaro220653.7%
(None)200973.3%
(Consultant)154032.6%
Linutronix135852.3%
Intel122882.0%
Cisco122802.0%
Oracle104821.7%
Freescale101161.7%
SGI86391.4%
Nokia73281.2%
SANPeople70451.2%
Broadcom59521.0%
MontaVista58101.0%
Linux Foundation57461.0%
Atmel52200.9%

One thing which jumps out here is that the amount of code contributed by developers known to be working on their own time has dropped; 2.6.22 will be one of the most corporate kernels yet.

Looking at the developers who put Signed-off-by lines onto patches yields some interesting results. If one tabulates all 12,678 signoffs in 2.6.22, the results look like this:

Developers with the most signoffs (total 12678)
Andrew Morton141511.2%
Linus Torvalds129910.2%
David S. Miller8146.4%
Paul Mackerras3813.0%
Jeff Garzik3442.7%
Andi Kleen2522.0%
Greg Kroah-Hartman2361.9%
Mauro Carvalho Chehab2361.9%
Stefan Richter2101.7%
Russell King1891.5%
James Bottomley1761.4%
Jaroslav Kysela1451.1%
Takashi Iwai1311.0%
Len Brown1261.0%
Kristian Høgsberg1261.0%
Patrick McHardy1170.9%
Jean Delvare1100.9%
Roland Dreier1090.9%
Antonino Daplas1060.8%
Dmitry Torokhov1050.8%

All authors must sign off on their code. Additionally, any maintainer who passes a patch up toward the mainline adds a signoff indicating that he or she believes the code is legitimate and suitable for inclusion. If one excludes signoffs by the author of each patch, the remaining 7,000 signoffs are (almost) all by people through whom the code has passed (a few of them are by additional authors of the patch). Those adding non-author signoffs can thus be thought of as the gatekeepers through whom each patch must pass. Non-author signoffs break down like this:

Non-author signoffs (total 7028)
Andrew Morton133619.0%
Linus Torvalds127918.2%
David S. Miller6409.1%
Paul Mackerras3715.3%
Jeff Garzik3224.6%
Greg Kroah-Hartman2223.2%
Mauro Carvalho Chehab2163.1%
Andi Kleen1932.7%
James Bottomley1632.3%
Jaroslav Kysela1422.0%
Russell King1321.9%
Stefan Richter1311.9%
Len Brown1151.6%
John W. Linville851.2%
Roland Dreier851.2%
Takashi Iwai791.1%
Martin Schwidefsky540.8%
David Woodhouse530.8%
Ralf Baechle480.7%
Antonino Daplas480.7%

In summary, 80% of the patches merged into the mainline kernel passed through the twenty developers listed above. One can take another step, and look at the number of non-author signoffs by employer:

Non-author signoffs by employer
Google133819.0%
Linux Foundation128118.2%
Red Hat124617.7%
Novell70010.0%
(Unknown)6609.4%
IBM5537.9%
(None)2934.2%
Intel1932.7%
SteelEye1632.3%
Cisco851.2%
MIPS Technologies480.7%
Nokia420.6%
Astaro410.6%
Analog Devices350.5%
QLogic350.5%
Cendio320.5%
SGI280.4%
NetApp280.4%
(Consultant)230.3%
Oracle220.3%

The bottom line: while Linux kernel development is a highly distributed activity, the work of several hundred developers is channeled through a surprisingly small number of individuals, and an even smaller number of companies on its way into the mainline.

Comments (10 posted)

More fun with file descriptors

In last week's episode, the kernel developers were considering the addition of a couple of flags to the open() system call; these flags would allow applications to select previously unavailable features like the non-sequential file descriptor range or immediate close-on-exec behavior. The problem that comes up quickly is that open() is just one of many system calls which creates file descriptors; most of the others do not have a parameter which allows an application to pass a set of accompanying flags. So it is not possible to request, for example, the non-sequential behavior when obtaining a file descriptor with socket(), pipe(), epoll_create(), timerfd(), signalfd(), accept(), and so on.

In the second version of the non-sequential file descriptor patch, Davide Libenzi attempted to address part of the problem by adding a socket2() system call with an added "flags" parameter. That was enough to frighten a number of developers; nobody really wants to see a big expansion of the system call list resulting from the addition of variations on all the file-descriptor-creating calls. Another approach, it seems, is required, but finding that approach is not entirely easy.

One possibility is to simply ignore the problem; not everybody is sold on the need for non-sequential file descriptors or immediate close-on-exec behavior. There are enough people who see a problem here to motivate some sort of solution, though. Ulrich Drepper, the glibc maintainer, has seen enough applications to conclude that the issue is real.

An alternative, suggested by Alan Cox, is to create a process state flag which controls the use of these features. So a call like:

    prctl(PR_SPARSEFD, 1);

would turn on non-sequential file descriptor allocation for all system calls made by the calling process. The problem here is that the lowest-available-descriptor behavior is a documented part of the POSIX binary interface. A process could waive that guarantee for itself, but it will always be hard to know that all libraries used by that process are safe in the absence of that behavior. One library might want to use non-sequential file descriptors, but that library cannot safely turn them on for the whole process without risking the creation of difficult bugs in obscure situations. It has been suggested that linker tricks could be used to avoid bringing older libraries, but Ulrich feels that people would respond by simply recompiling the older libraries and the potential bugs would remain.

Linus came into the discussion with a statement that neither adding a bunch of new system calls nor the global flag were acceptable. Instead, he came up with a completely different idea: create a mechanism which allows a single system call to be invoked with a specific set of flags. His proposed interface is:

    int syscall_indirect(unsigned long flags, sigset_t sigmask,
                         int syscall, unsigned long args[6]);

The result would be a call to the given system call with the requested arguments. For the duration of the call, the given flags would be in effect, and signals in sigmask would be blocked. Even before adding any flags, this mechanism could be used to implement the series of system calls (pselect(), for example) which exists only to apply a signal mask to an earlier version of the call. Then the non-sequential file descriptor and close-on-exec behavior could be requested via the flags argument. Beyond that, flags could be added to control the handling of symbolic links, and various other things. Matt Mackall suggested that the "syslet" mechanism could be implemented as a "run this call asynchronously" flag.

This approach is not without its potential problems. There are worries that the flags bits could be quickly exhausted, once again making it hard to add options to existing system calls. Linus suggests overloading the flag bits as a way of making them last longer. That approach risks problems if application developers attempt to apply the wrong flags for a given system call - there would be no automatic way of catching such errors - but it is unlikely that applications would be calling syscall_indirect() themselves, so this risk is relatively small. It is appropriate to worry about whether any conceivable, sensible behavior modification is covered by this interface, or whether it needs a different set of parameters. And one might well wonder whether, some years from now, a large percentage of system calls will be made via syscall_indirect().

This new system call suffers from one other shortcoming as well: there is currently no working implementation. That will likely change at some point, leading to a wider discussion of the proposed interface. If it still seems like a good idea, we might just have a way of adding new behavior to old functions without an explosion in the number of system calls. Sometimes, perhaps, it really is true that problems in computer science are best solved through the addition of another level of indirection.

Comments (8 posted)

KHB: Real-world disk failure rates: surprises, surprises, and more surprises

June 12, 2007

This article was contributed by Valerie Aurora

At this year's USENIX File Systems and Storage Technology Conference, we were treated to two papers studying failure rates in disk populations numbering over 100,000. These kinds of data sets are hard to get - first you have to have 100,000 disks, then you have to record failure-related data faithfully for years on end, and then you have to release the data in a form that doesn't get anyone sued. The storage community has salivated after this kind of real-world data for years, and now we have not one, but two (!) long-term studies of disk failure rates. The conference hall was packed during these two presentations. When the talks were done, we stumbled out into the hallway, dazed and excited by the many surprising results. Heat is negatively correlated with failure! Failures show short AND long-term correlation! SMART errors do mean the drive is more likely to fail, but a third of drives die with no warning at all! The size of the data sets, the quality of analysis, and the non-intuitive results win these two papers a place on the Kernel Hacker's Bookshelf.

The first paper (and winner of Best Paper), was Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?, by Bianca Schroeder and Garth Gibson. They reviewed failure data from a collection of 100,000 disks, over a period of up to 5 years. The disks were part of a variety of HPC clusters and an Internet service provider. Disk failure was defined as the disk being replaced. The date of replacement was also used as the date of the failure, since determining exactly when a disk failed was not possible.

Their first major result was that the real-world annualized failure rate (average percentage of disks failing per year) was much higher than the manufacturer's estimate - an average of 3% vs. the estimated 0.5 - 0.9%. Disk manufacturers obviously can't test disks for a year before shipping them, so they stress test disks in high-temperature, high-vibration, high-workload environments, and use data from previous models to estimate MTTF. Only one set of disks had a real-world failure rate less than the estimated failure rate, and one set of disks had a 13.5% annualized failure rate!

More surprisingly, they found no correlation between failure rate and disk type - SCSI, SATA, or fiber channel. The most reliable disk set was composed of only SATA drives, which are commonly regarded to be less reliable than SCSI or fibre channel.

In another surprise, they debunked the "bathtub model" of disk failure rates. In this theory, disks experience a higher "infant mortality" initial rate of failure, then settle down for a few years of low failure rate, and then begin to wear out and fail. The graph of the probability vs. time looks like a bathtub, flat in the middle and sloping up at the ends. Instead, the real-world failure rate began low and steadily increased over the years. Disks don't have a sweet spot of low failure rate.

Failures within a batch of disks were strongly correlated over both short and long time periods. If a disk had failed in a batch, then there was a significant probability of a second failure up to at least 2 years later. If one disk in your batch has just gone, you are more likely to have another disk failure in the same batch. Scary news for RAID arrays with disks from the same batch. A recent paper in the 2006 Storage Security and Survivability Workshop, Using Device Diversity to Protect Data against Batch-Correlated Disk Failures, by Jehan-François Pâris and Darrell D. E. Long, calculated the increase in RAID reliability from mixing batches of disks. Using more than one kind of disk increases costs, but with the combination of data from these two papers, RAID users can calculate the value of the extra reliability and make the most economical decision.

The second paper, Failure Trends in a Large Disk Drive Population, by Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andrè Barroso, reports on disk failure rates at Google. They used a Google tool for recording system health parameters and many other staples of Google software (Mapreduce, Bigtable, etc.) to collect and analyze the data. They focused on SMART statistics - the built-in disk drive monitoring in many modern disk drives, which records statistics about scan errors and blocks relocated.

The first result agrees with the first paper: The annualized failure rate was much higher than estimated, between 1.7% and 8.6%. They next looked for correlation between failure rate and drive utilization (as estimated by the amount of data read or written to the drive). They find a much weaker correlation between higher utilization and failure rate than expected, with low utilization disks often having higher failure rates than medium utilization disks, and, in the case of the 3-year-old vintage of disks, higher than the high utilization group.

Now for the most surprising result. In Google's population of cheap ATA disks, high temperature was negatively correlated with failure! In the authors' words:

In fact, there is a clear trend showing that lower temperatures are associated with higher failure rates. Only at very high temperatures is there a slight reversal of this trend.

This correlation held true over a temperature range of 17-55 C. Only in the 3-year-old disk population was there correlation between high temperatures and failure rates. My completely unsupported and untested hypothesis is that drive manufacturers stress test their drives in high temperature environments to simulate longer wear. Perhaps they have unwittingly designed drives that work better in their high-temperature test environment at the expense of a more typical low-temperature field environment.

Finally, they looked at the SMART data gathered from the drives. Overall, any kind of SMART error correlated strongly with disk failure. A scan error occurs when the disk checks data in the background, reading the entire disk. Within 8 months of the first scan error, about 30% of drives would fail completely. A reallocation error occurs when a block can't be written, and the block is reassigned to another location on disk. A reallocation error resulted in about 15% of affected drives failing with 8 months. On the other hand, 36% of the drives that failed had no warning whatsoever, either from SMART errors or from exceptionally high temperatures.

For Google's purposes, the predictive power of SMART is of limited utility. Replacing every disk that had a SMART error would end up replacing good disks that will run for years to come about 70% of the time. For Google, this isn't cost-effective, since all their data is replicated several times. But for an individual user for whom losing their disk is a disaster, replacing the disk at the first sign of a SMART error makes eminent sense. I have personally had two laptop drives start spitting SMART errors in time to get my data off the disk before it died completely.

Overall, these are two exciting papers with long-awaited real-world failure data on large disk populations. We should expect to see more publications analyzing these data sets in the years to come.

Valerie Henson is a Linux file systems consultant specializing in file system check and repair.

Comments (24 posted)

Patches and updates

Kernel trees

Andrew Morton 2.6.22-rc4-mm2 ?
Chris Wright Linux 2.6.21.5 ?
Chris Wright Linux 2.6.21.4 ?
Ingo Molnar v2.6.21.4-rt10 ?
Ingo Molnar v2.6.21.4-rt11 ?
Chris Wright Linux 2.6.20.14 ?
Chris Wright Linux 2.6.20.13 ?
Willy Tarreau Linux 2.4.35-pre5 ?
Willy Tarreau Linux 2.4.34.5 ?

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

clameter@sgi.com Slab defragmentation V3 ?

Networking

andy-/Zus8d0mwwtBDgjK7y7TUQ@public.gmane.org Radiotap injection for Monitor Mode ?

Security-related

Toshiharu Harada TOMOYO Linux ?

Virtualization and containers

Miscellaneous

Rusty Russell struct list_node ?
Kay Sievers udev 112 release ?
Mark M. Hoffman New hwmon maintainer ?

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds