Kernel development
Brief items
Kernel release status
The current development kernel is 2.6.31-rc7, released on August 21. "But apart from a couple of bigger ones (OMAP GPIO/UART fixes and the radeon/kms changes), it's really pretty small. The bulk of those 290 files changed are basically few-liners in 213 commits (shortlog below), and in general we should have cut down the regression list another tiny bit.". The short format changelog is in the announcement, along with some other descriptions of changes and areas that need testing.
The current unresolved regression count stands at 26, out of a total reported of 108.
Kernel development news
Quotes of the week
+ if (iommu->cap == (uint64_t)-1 && iommu->ecap == (uint64_t)-1) { + /* Promote an attitude of violence to a BIOS engineer today */
In brief
What is direct I/O, really? Linux, like many operating systems, supports direct I/O operations to block devices. But how, exactly, should programmers expect direct I/O to work? As a recent document posted by Ted Ts'o notes, there is no real specification for what direct I/O means:
Ted's document is an attempt to better specify what is really going on when a process requests a direct I/O operation. It is currently focused on the ext4 filesystem, but the hope is to forge a consensus among Linux filesystem developers so that consistent semantics can be obtained on all filesystems.
Can you thaw out TuxOnIce? TuxOnIce is the perennially out-of-tree hibernation implementation. It has a number of nice features which are not available with the mainstream version; these features have never managed to get into a form where they could be merged. TuxOnIce developer Nigel Cunningham has recently concluded that it looks like this merger is not going to happen because the relevant people are simply too busy. He says:
In response, he is now actively looking for developers who would like to take on the task of getting TuxOnIce (or, at least, parts of it) into the mainline. He has put together a "todo" list for potentially interested parties.
Lazy workqueues. Kernel developers have been concerned for years that the number of kernel threads was growing beyond reason; see, for example, this article from 2007. Jens Axboe recently became concerned himself when he noticed that his system (a modest 64-processor box) had 531 kernel threads running on it. Enough, he decided, was enough.
His response was the lazy workqueue concept. As might be expected, this patch is an extension of the workqueue mechanism. A "lazy" workqueue can be created with create_lazy_workqueue(); it will be established with a single worker thread. Unlike single-threaded workqueues, though, lazy workqueues still try to preserve the concept of dedicated, per-CPU worker threads. Whenever a task is submitted to a lazy workqueue, the kernel will direct it toward the thread running on the submitting CPU; if no such thread exists, the kernel will create it. These threads will exit if they are idle for a sufficient period.
The end result was a halving of the number of kernel threads on Jens's system. That still seems like too many threads, but it's a good step in the right direction.
Embedded x86. Thomas Gleixner started his patch series with a note that the "embedded nightmare" has finally come to the x86 architecture. The key development here is a new set of patches intended to support Intel's new "Moorestown" processor series; these patches added a bunch of code to deal with the new quirks in this processor. Rather than further clutter the x86 architecture code, Thomas decided that it was time for a major cleanup.
The result is a new, global platform_setup structure designed to tell the architecture code how to set up the current processor. It includes a set of function pointers which handle platform-specific tasks like locating BIOS ROMs, setting up interrupt handling, initializing clocks, and much more; it is a 32-part patch in all. This new structure is able to encapsulate many of the initialization-time differences between the 32-bit and 64-bit x86 architectures, the new "Moorestown" architecture, and various virtualized variants as well. It is also runtime-configurable, so a single kernel should be able to run efficiently on any of the supported systems.
O_NOSTD. Longstanding Unix practice dictates that applications are started with the standard input, output, and error I/O streams on file descriptors 0, 1, and 2, respectively. The assumption that these file descriptors will be properly set up is so strong that most developers never think to check them. So interesting things can happen if an application is run with one or more of the standard file descriptors closed.
Consider, for example, running a program with file descriptor 2 closed. The next file the program opens will be assigned that descriptor. If something then causes the program to write to (what it thinks is) the standard error stream, that output will, instead, go to the other file which had been opened, probably corrupting that file. A malicious user can easily make messes this way; when setuid programs are involved, the potential consequences are worse.
There are a number of ways to avoid falling into this trap. An application can, on startup, ensure that the first three file descriptors are open. Or it can check the returned file descriptor from open() calls and use dup() to change the descriptor if need be. But these options are expensive, especially considering that, almost all of the time, the standard file descriptors are set up just as they should be.
Eric Blake has proposed a new alternative in the form of the O_NOSTD flag. The semantics are simple: if this flag is provided to an open() call, the kernel will not return one of the "standard" file descriptors. If this patch goes in (and there does not seem to be any opposition to that), application developers will be able to use it to ensure that they are not getting any file descriptor surprises without additional runtime cost.
There is a cost, of course, in the form of a non-standard flag that will not be supported on all platforms. One could almost argue that it would be better to add a specific flag for cases where a file descriptor in the [0..2] range is desired. But that would be a major ABI change to say the least; it's not an idea that would be well received.
Linux-ARM mailing lists. Russell King has announced that the ARM-related mailing lists on arm.linux.kernel.org will be shut down immediately. He is, it seems, not happy about some of the criticism he has received about the operation of those lists. So the lists will be moving, though exactly where is not entirely clear. David Woodhouse has created a new set of lists on infradead; he appears to have moved the subscriber lists over as well. There is also a push to move the list traffic to vger, but the preservation of the full set of lists and their subscribers suggests that the infradead lists are the ones which will actually get used.
Page-based direct I/O
An "address space" in kernel jargon is a mapping between a range of addresses and their representation in an underlying filesystem or device. There is an address space associated with every open file; any given address space may or may not be tied to a virtual memory area in a process's virtual (memory) address space. In a typical process, a number of address spaces will exist for mappings of the executable being run, files the process has open, and ranges of anonymous user memory (which use swap as their backing store). There are a number of ways for processes to operate on their address spaces, one of the stranger of which being direct I/O. A new patch series from Jens Axboe looks to rationalize the direct I/O path a bit, making it more flexible in the process.The idea behind direct I/O is that data blocks move directly between the storage device and user-space memory without going through the page cache. Developers use direct memory for either (or both) of two reasons: (1) they believe they can manage caching of file contents better than the kernel can, or (2) they want to avoid overflowing the page cache with data which is unlikely to be of use in the near future. It is a relatively little-used feature which is often combined with another obscure kernel capability: asynchronous I/O. The biggest consumers, by far, of this functionality are large relational database systems, so it is not entirely surprising that a developer currently employed by Oracle is working in this area.
When the kernel needs to do something with an address space, it usually looks into the associated address_space_operations structure for an appropriate function. So, for example, normal file I/O are handled with:
int (*writepage)(struct page *page, struct writeback_control *wbc); int (*readpage)(struct file *filp, struct page *page);
As with the bulk of low-level, memory-oriented kernel operations, these functions operate on page structures. When memory is managed at this level, there is little need to worry about whether it is user-space or kernel memory, or whether it is in the high-memory zone. It's all just memory. The function which handles direct I/O looks a little different, though:
ssize_t (*direct_IO)(int rw, struct kiocb *iocb, const struct iovec *iov, loff_t offset, unsigned long nr_segs);
The use of the kiocb structure shows the assumption that direct I/O will be submitted through the asynchronous I/O path. Beyond that, though, the iovec structure pointing to the buffers to be transferred comes directly from user space, and it contains user-space addresses. That, in turn implies that the direct_IO() function must itself deal with the process of getting access to the user-space buffers. That task is generally handled in VFS-layer generic code, but there's another problem: the direct_IO() function cannot be called on kernel memory.
The kernel does not normally need to use the direct I/O paths itself, but there is one exception: the loopback driver. This driver allows an ordinary file to be mounted as if it were a block device; it can be most useful for accessing filesystem images stored within disk files. But files accessed via a loopback mount may well be represented in the page cache twice: once on each side of the loopback mount. The result is a waste of memory which could probably be put to better uses.
It would, in summary, be nice to change the direct_IO() interface to avoid this memory waste, and to make it a little bit more consistent with the other address space operations. That is what Jens's patch does. With that patch, the interface becomes:
struct dio_args { int rw; struct page **pages; unsigned int first_page_off; unsigned long nr_segs; unsigned long length; loff_t offset; /* * Original user pointer, we'll get rid of this */ unsigned long user_addr; }; ssize_t (*direct_IO)(struct kiocb *iocb, struct dio_args *args);
In the new API, many of the relevant parameters have been grouped into the dio_args structure. The memory to be transferred can be found by way of the pages_array. The higher-level VFS direct I/O code now handles the task of mapping user-space buffers and creating the pages array.
The impact on the code is, for the most part, small; it's mostly a matter of moving the location where the translation from user-space address to page structures is done. The current code does have a potential problem in that it only processes one I/O segment at a time, possibly creating performance problems for some kinds of applications. That mode of operation is not really wired into the system, though, and can presumably be fixed at some point.
The only other objection came from Andrew
Morton, who does not like the way Jens implemented the process of working
through the array of page structures. The index into this array
(called head_page) is built into struct dio and hidden
from the code which is actually working through the pages; that leads to
potential confusion, especially if the operation aborts partway through.
Andrew called it "a disaster waiting to happen
" and
recommended that indexing be made explicit where the pages array
is processed.
That is a detail, though - albeit a potentially important one. The core goals and implementation appear to have been received fairly well. It seems highly unlikely that this code could be ready for the 2.6.32 merge window, but we might see it aiming for the mainline in a subsequent development cycle.
Development statistics for 2.6.31
The Linux Foundation recently announced the release of an updated version of its kernel authorship report, co-written by LWN editor Jonathan Corbet. The information there is interesting, but, since it stops with the 2.6.30 kernel, it also is ancient history at this point. 2.6.30 came out two full months ago, after all. LWN readers, certainly, are used to more current information. Since 2.6.31 is getting close to ready, it seems like the right time to look at this development cycle and see where the code came from.As of this writing (just after the release of 2.6.31-rc7), the 2.6.31 development cycle had seen the incorporation of 10,663 non-merge changesets from 1,146 individual developers. These patches added almost 903,000 lines of code and removed just over 494,000 lines, for a net growth of just over 408,000 lines. According to Rafael Wysocki's August 25 report, this work introduced 108 regressions into the kernel, 26 of which still lack a resolution.
The largest individual contributors in the 2.6.31 development cycle were:
Most active 2.6.31 developers
By changesets Ingo Molnar 276 2.6% Peter Zijlstra 260 2.4% Paul Mundt 204 1.9% Takashi Iwai 150 1.4% Bartlomiej Zolnierkiewicz 149 1.4% Steven Rostedt 139 1.3% Tejun Heo 134 1.3% Johannes Berg 133 1.2% Magnus Damm 119 1.1% Mike Frysinger 115 1.1% roel kluin 105 1.0% Greg Kroah-Hartman 101 0.9% Erik Andrén 100 0.9% Paul Mackerras 85 0.8% Mark Brown 85 0.8% Bill Pemberton 82 0.8% Jaswinder Singh Rajput 79 0.7% Ben Dooks 72 0.7% Joe Perches 72 0.7% Alexander Beregalov 71 0.7%
By changed lines Bartlomiej Zolnierkiewicz 220749 18.3% Jerry Chuang 78441 6.5% Forest Bond 50834 4.2% David Daney 40052 3.3% Jerome Glisse 38604 3.2% Vlad Zolotarov 23260 1.9% Ingo Molnar 22614 1.9% James Smart 19209 1.6% Bill Pemberton 17249 1.4% dmitry pervushin 14532 1.2% Greg Kroah-Hartman 13234 1.1% Wai Yew CHAY 12741 1.1% Michael Chan 11887 1.0% Linus Walleij 11626 1.0% Paul Mundt 10735 0.9% Peter Zijlstra 10202 0.8% Zhu Yi 10197 0.8% Ben Dooks 10150 0.8% Johannes Berg 9532 0.8% Kalle Valo 9263 0.8%
Ingo Molnar always shows up near the top of the changeset statistics. He has, as usual, contributed work all over the core kernel and x86 architecture code, but the bulk of his work this time is in the performance counters code; most of Peter Zijlstra's contributions were also in this area. The merging of this fast-changing subsystem caused those two developers to be responsible for 5% of the patches going into the 2.6.31 release. Paul Mundt wrote a vast number of Super-H architecture patches, and Takashi Iwai contributed large numbers of ALSA patches.
#5 on the changesets list is Bartlomiej Zolnierkiewicz, who also comes out on top in terms of the number of lines changed. He contributed a few IDE patches, despite having handed off responsibility for that subsystem, but most of his work went into the cleaning-up of Ralink wireless drivers in the staging tree. This cleanup resulted in the removal of an amazing 208,000 lines of code. Jerry Chuang added the RealTek RTL8192SU wireless driver (to staging), Forest Bond added the VIA Technologies VT6655 driver (to staging), David Daney did a bunch of MIPS work (including adding the Octeon Ethernet driver to the staging tree), and Jerome Glisse added kernel mode setting support for Radeon graphics chipsets.
As we have seen in the past few development cycles, the staging tree is the source of much of the change in the kernel tree. The nature of that change is, itself, changing, though. The rush of adding out-of-tree drivers to the staging tree has slowed considerably; we're starting to see more work dedicated to fixing up the code which is already there.
The developers contributing to 2.6.31 were supported by a minimum of 194 employers. The most active of those were:
Most active 2.6.31 employers
By changesets (None) 1704 16.0% Red Hat 1587 14.9% Intel 878 8.2% (Unknown) 846 7.9% IBM 667 6.3% Novell 614 5.8% Renesas Technology 345 3.2% Fujitsu 223 2.1% (Consultant) 212 2.0% Analog Devices 212 2.0% Oracle 175 1.6% Nokia 131 1.2% AMD 129 1.2% Atheros Communications 118 1.1% MontaVista 104 1.0% Xelerated AB 100 0.9% (Academia) 92 0.9% NetApp 91 0.9% HP 86 0.8% Wolfson Microelectronics 85 0.8%
By lines changed (None) 311803 25.8% Red Hat 124831 10.3% Realtek 78441 6.5% Intel 62559 5.2% Broadcom 51806 4.3% Logic Supply 51401 4.3% (Unknown) 47165 3.9% Cavium Networks 40086 3.3% IBM 39991 3.3% Novell 31979 2.6% Renesas Technology 31674 2.6% (Consultant) 23659 2.0% Emulex 19209 1.6% University of Virginia 17607 1.5% Nokia 16234 1.3% Embedded Alley Solutions 15229 1.3% Creative Technology 12741 1.1% Oracle 11704 1.0% Analog Devices 10760 0.9% Texas Instruments 10639 0.9%
The top group in either category is developers working on their own time, followed by Red Hat, which merged a few large chunks of code this time around.
A look at non-author signoffs (a hint as to which subsystem maintainers admitted the patches into the mainline) shows a continuation of recent trends:
Top non-author signoffs in 2.6.31
Individuals David S. Miller 964 10.1% Ingo Molnar 948 9.9% Greg Kroah-Hartman 582 6.1% John W. Linville 575 6.0% Andrew Morton 569 6.0% Mauro Carvalho Chehab 535 5.6% Linus Torvalds 254 2.7% James Bottomley 237 2.5% Benny Halevy 191 2.0% Paul Mundt 159 1.7%
Employers Red Hat 3686 38.7% Novell 1061 11.1% Intel 829 8.7% 572 6.0% (None) 422 4.4% IBM 383 4.0% Linux Foundation 254 2.7% Oracle 228 2.4% Panasas 193 2.0% (Consultant) 168 1.8%
49.8% of the patches going into the mainline for 2.6.31 passed through the hands of developers working for just two companies: Red Hat and Novell. Linux kernel developers work for a large number of companies, but subsystem maintainers are increasingly concentrated in a very small number of places.
In summary, it is a fairly typical development cycle for the kernel in recent times. The number of changes is high (but not a record), as is the number of developers. The transient effect of the staging tree is beginning to fade; it is becoming just another path for drivers heading into the mainline. As a whole, the process seems to be functioning in a smooth and robust manner.
(As always, your editor would like to thank Greg Kroah-Hartman for his assistance in the preparation of these statistics.)
HWPOISON
One downside to the ever-increasing memory size available on computers is an increase in memory failures. As memory density increases, error rates also rise. To offset this increased error rate, recent processors have included support for "poisoned" memory, an adaptive method for flagging and recovering from memory errors. The HWPOISON patch recently developed by Andi Kleen and Fengguang Wu provides the Linux kernel support for memory poisoning. Thus, when HWPOISON is coupled with the appropriate fault-tolerant processors, Linux users can enjoy systems that are more tolerant to memory errors in spite of increased memory densities.
Memory errors are classified as either soft (transient) or hard (permanent). In soft errors, cosmic rays or random errors can toggle the state of a bit in a SRAM or DRAM memory cell. In hard errors, memory cells become physically degraded. Hardware can detect - and automatically correct - some of these errors via Error Correcting Codes (ECC). While single bit data errors can be corrected via ECC, multi-bit data errors cannot. For these uncorrectable errors, the hardware typically generates a trap which, in turn, causes a kernel panic.
The blanket action of crashing the machine for all uncorrected soft and hard memory errors is sometimes over-reactive. If the detected memory error never actually corrupts executing software, then ignoring or isolating the error is the most desirable action. Memory "poisoning", with its delayed handling of errors, allows for a more graceful recovery from and isolation of uncorrected memory errors rather than just crashing the system. However, memory poisoning requires both hardware and kernel support.
The HWPOISON patch is very timely: Intel's recent preview of its Xeon processor (codenamed Nehalem-EX) promises support for memory poisoning. Intel has included its Machine Check Abort (MCA) Recovery architecture in Nehalem-EX. Originally developed for ia64 processors, Intel's MCA Recovery architecture supports memory poisoning and various other hardware failure recovery mechanisms. While, HWPOISON adopted Intel's usage of the term "poisoning", this should not be confused with the unrelated Linux kernel concept of poisoning: writing a pattern to memory to catch uninitialized memory.
While the specifics of how hardware and the kernel might implement memory poisoning varies, the general concept is as follows. First, hardware detects an uncorrectable error from memory transfers into the system cache or on the system bus. Alternatively, memory may be occasionally "scrubbed." That is, a background process may initiate an ECC check on one or more memory pages. In either case, the hardware doesn't immediately cause a machine check but rather flags the data unit as poisoned until read (or consumed). Later, when erroneous data is read by executing software, a machine check is initiated. If the erroneous data is never read, no machine check is necessary. For example, a modified cache line written back to main memory may have a data word error that is marked as poisoned. Once the poisoned data is actually used (loaded into a processor register, etc.), a machine check occurs, but not before. Thus, any poisoning machine check event may happen long after the corresponding data error event.
HWPOISON is a poisoned data handler invoked by the low-level Linux machine check code. Where possible, HWPOISON attempts to gracefully recover from memory errors, and contain faulty hardware to prevent future errors. At first glance, an obvious solution for the poison handler would focus on the specific process and memory address(es) associated with the data error. However, this is infeasible for two reasons. First, the offending instruction and process cannot be determined due to delays between the data error consumption and execution of the poison handler. These delays include asynchronous hardware reporting of the machine check event, and delayed execution of the handler via a workqueue. Thus, a different process may be executing by the time the HWPOISON handler is ready to act. Second, bad-memory containment must be done at a level where the kernel actually manages memory. Thus, HWPOISON focuses on memory containment at the page granularity rather than the low granularity supported by Intel's MCA Recovery hardware.
HWPOISON finds the page containing the poisoned data and attempts to isolate this page from further use. Potentially corrupted processes can then be located by finding all processes that have the corrupted page mapped. HWPOISON performs a variety of different actions. Its exact behavior depends upon the type of corrupted page and various kernel configuration parameters.
To enable the HWPOISON handler, the kernel configuration parameter MEMORY_FAILURE must be set. Otherwise, hardware poisoning will cause a system panic. Additionally, the architecture must support data poisoning. As of this writing, HWPOISON is enabled for all architectures to make testing on any machine possible via a user-mode fault injector, which is detailed below.
The handler must allow for multiple poisoning events occurring in a short time window. HWPOISON uses a bit in the flags field of a struct page to mark and lock a page as poisoned. Since page flags are currently in short supply, this choice was not made without consternation and debate by kernel hackers. See this LWN article for further details about this issue. In any case, this bit allows previously poisoned pages to be ignored by the handler.
The handler ignores the following types of pages: 1) pages that have been previously poisoned, 2) pages that are outside of kernel control (an invalid page frame number), 3) reserved kernel pages, and 4) pages with usage count of zero, which implies either a free or higher order kernel page. The poisoned bit in the flags field serves as a lock allowing rapid-fire poisoning machine checks on the same page to be handled only once by ignoring subsequent calls to the handler. Reserved kernel pages and zero count pages are ignored with the peril of a system panic. However, these pages containing critical kernel data cannot be isolated. Thus, HWPOISON has no useful options for recovery.
In addition to ignoring pages, possible HWPOISON actions include recovery, delay, and failure. Recovery means HWPOISON took action to isolate a page. Ignore, failure, and delay are all similar in that the page was not completely isolated, except for flagging the page as poisoned. With delay, handling can be safely postponed until a later time when the page might be referenced. By delaying, some transient errors may not reoccur or may be irrelevant. HWPOISON delays any action on kernel slab or buddy allocator pages or free pages. With failure, HWPOISON could, but does not support handling the page. HWPOISON takes an action of failure on unknown or huge pages. Huge pages fail since reverse mapping is not supported to identify the process which owns the page.
Clean pages in either the swap or page cache can be easily recovered by invalidating the cache entry for these pages. Since these pages have a duplicate backing copy on disk, the in-memory cache copy can be invalidated. Unlike clean pages, dirty pages in these caches have differences between the memory and disk copies. Thus, poisoned dirty pages may have important data corruption. However, dirty pages in the page cache are recovered by invalidation of the cache. Additionally, a page error is set for the dirty page cache page so subsequent user system calls on the file associated with the page will return an I/O error. Dirty pages in the swap cache are handled in a delayed fashion. The dirty flag is cleared for the page and the page swap cache entry is maintained. On a later page fault the associated application will be killed.
To recover from poisoned, user-mapped pages, HWPOISON first finds all user processes which mapped the corrupted page. For clean pages with backing store, HWPOISON need not take recovery action since the process does not need to be killed. Dirty pages are unmapped from all associated processes, which are subsequently killed. Two VM sysctl parameters are supported by HWPOISON with respect to killing user processes: vm.memory_failure_early_kill and vm.memory_failure_recovery. Setting the vm.memory_failure_early_kill parameter causes an immediate SIGBUS to be sent to the user process(es). The kill is done using a catchable SIGBUS with BUS_MCEERR_AO. Thus, processes can decide how they want to handle the data poisoning. The vm.memory_failure_recovery parameter delays the killing: the page is merely unmapped by HWPOISON. When this unmapped page is actually referenced at a later time then a SIGBUS will be sent.
An HWPOISON patch git repository is available at
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6.git hwpoison
Since faulty hardware that supports data poisoning is not easy to come by, a fault injection test harness mm/hwpoison-inject.c has also been developed. This simple harness uses debugfs to allow failures at an arbitrary page to be injected.
While HWPOISON was developed for x86-based machines, interest has been expressed by supporters of other Linux server architectures, such as ia64 and sparc (discussed here). Thus, the patch may proliferate on future Linux server distributions, allowing users of future Linux servers to enjoy increased fault tolerance. Now that Intel is supporting MCA Recovery on x86 machines, some desktop users may also enjoy its benefits in the near future.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>