Brief items
The current development kernel is 2.6.31-rc7,
released on August 21.
"
But apart from a couple of bigger ones (OMAP GPIO/UART fixes and the
radeon/kms changes), it's really pretty small. The bulk of those 290 files
changed are basically few-liners in 213 commits (shortlog below), and in
general we should have cut down the regression list another tiny
bit.". The short format changelog is in the announcement, along
with some other descriptions of changes and areas that need testing.
The current unresolved regression
count stands at 26, out of a total reported of 108.
Comments (none posted)
Kernel development news
Many kernel developers believe that userspace is burned into ROM
and the only thing they can change is the kernel. That turns out
to be incorrect.
--
Avi Kivity
+ if (iommu->cap == (uint64_t)-1 && iommu->ecap == (uint64_t)-1) {
+ /* Promote an attitude of violence to a BIOS engineer today */
--
David Woodhouse
You don't get a consistent filesystem with ext2, either. And if
your claim is that several hundred lines of fsck output detailing
the filesystem's destruction somehow makes things all better, I
suspect most users would disagree with you.
--
Ted Ts'o
I recommend a sledgehammer.
If you want to lose your data, you might as well have some fun.
--
Rik van Riel
Comments (9 posted)
By Jonathan Corbet
August 26, 2009
What is direct I/O, really? Linux, like many operating systems,
supports direct I/O operations to block devices. But how, exactly, should
programmers expect direct I/O to work? As
a
recent document posted by Ted Ts'o notes, there is no real
specification for what direct I/O means:
It is not a part of POSIX, or SUS, or any other formal standards
specification. The exact meaning of O_DIRECT has historically been
negotiated in non-public discussions between powerful enterprise
database companies and proprietary Unix systems, and its behaviour
has generally been passed down as oral lore rather than as a formal
set of requirements and specifications.
Ted's document is an attempt to better specify what is really going on when
a process requests a direct I/O operation. It is currently focused on the
ext4 filesystem, but the hope is to forge a consensus among Linux
filesystem developers so that consistent semantics can be obtained on all
filesystems.
Can you thaw out TuxOnIce? TuxOnIce is the perennially out-of-tree
hibernation implementation. It has a number of nice features which are
not available with the mainstream version; these features have never
managed to get into a form where they could be merged. TuxOnIce developer
Nigel Cunningham has recently concluded
that it looks like this merger is not going to happen because the relevant
people are simply too busy. He says:
Given that this has been the outcome so far, I see no reason to
imagine that we're going to make any serious progress any time
soon.
In response, he is now actively looking for developers who would like to
take on the task of getting TuxOnIce (or, at least, parts of it) into the
mainline. He has put together a "todo"
list for potentially interested parties.
Lazy workqueues. Kernel developers have been concerned for years
that the number of kernel threads was growing beyond reason; see, for
example, this article from
2007. Jens Axboe recently became concerned himself when he noticed that
his system (a modest 64-processor box) had 531 kernel threads running on
it. Enough, he decided, was enough.
His response was the lazy
workqueue concept. As might be expected, this patch is an extension of
the workqueue mechanism. A "lazy" workqueue can be created with
create_lazy_workqueue(); it will be established with a single
worker thread. Unlike single-threaded workqueues, though, lazy workqueues
still try to preserve the concept of dedicated, per-CPU worker threads.
Whenever a task is submitted to a lazy workqueue, the kernel will direct it
toward the thread running on the submitting CPU; if no such thread exists,
the kernel will create it. These threads will exit if they are idle for a
sufficient period.
The end result was a halving of the number of kernel threads on Jens's
system. That still seems like too many threads, but it's a good step in
the right direction.
Embedded x86. Thomas Gleixner started his patch series with a note
that the "embedded nightmare" has finally come to the x86 architecture.
The key development here is a new set of patches intended to support
Intel's new "Moorestown" processor series; these patches added a bunch of
code to deal with the new quirks in this processor. Rather than further
clutter the x86 architecture code, Thomas decided that it was time for a
major cleanup.
The result is a new, global platform_setup structure designed to
tell the architecture code how to set up the current processor. It
includes a set of function pointers which handle platform-specific tasks
like locating BIOS ROMs, setting up interrupt handling, initializing
clocks, and much more; it is a 32-part patch in all. This new structure is
able to encapsulate many of the initialization-time differences between the
32-bit and 64-bit x86 architectures, the new "Moorestown" architecture, and
various virtualized variants as well. It is also runtime-configurable, so
a single kernel should be able to run efficiently on any of the supported
systems.
O_NOSTD. Longstanding Unix practice dictates that applications are
started with the standard input, output, and error I/O streams on file
descriptors 0, 1, and 2, respectively. The assumption that these file
descriptors will be properly set up is so strong that most developers never think to
check them. So interesting things can happen if an application is run with
one or more of the standard file descriptors closed.
Consider, for example, running a program with file
descriptor 2 closed. The next file the program opens will be assigned that
descriptor. If something then causes the program to write to (what it
thinks is) the standard error stream, that output will, instead, go to the
other file which had been opened, probably corrupting that file. A
malicious user can easily make messes this way; when setuid programs are
involved, the potential consequences are worse.
There are a number of ways to avoid falling into this trap. An application
can, on startup, ensure that the first three file descriptors are open. Or
it can check the returned file descriptor from open() calls and
use dup() to change the descriptor if need be. But these options
are expensive, especially considering that, almost all of the time, the
standard file descriptors are set up just as they should be.
Eric Blake has proposed a new alternative in the form of the O_NOSTD flag. The
semantics are simple: if this flag is provided to an open() call,
the kernel will not return one of the "standard" file descriptors. If this
patch goes in (and there does not seem to be any opposition to that),
application developers will be able to use it to ensure that they are not
getting any file descriptor surprises without additional runtime cost.
There is a cost, of course, in the form of a non-standard flag that will
not be supported on all platforms. One could almost argue that it would be
better to add a specific flag for cases where a file descriptor in the
[0..2] range is desired. But that would be a major ABI change to say the
least; it's not an idea that would be well received.
Linux-ARM mailing lists. Russell King has announced that the
ARM-related mailing lists on arm.linux.kernel.org will be shut down
immediately. He is, it seems, not happy about some of the criticism he has
received about the operation of those lists. So the lists will be moving,
though exactly where is not entirely clear. David Woodhouse has created a new set of lists on infradead; he
appears to have moved the subscriber lists over as well. There is also a
push to move the list traffic to vger, but
the preservation of the full set of lists and their subscribers suggests
that the infradead lists are the ones which will actually get used.
Comments (35 posted)
By Jonathan Corbet
August 25, 2009
An "address space" in kernel jargon is a mapping between a range of
addresses and their representation in an underlying filesystem or device.
There is an address space associated with every open file; any given
address space may or may not be tied to a virtual memory area in a
process's virtual (memory) address space. In a typical process, a number
of address spaces will exist for mappings of the executable being
run, files the process has open, and ranges of anonymous user memory (which
use swap as their backing store). There are a number of ways for processes
to operate on their address spaces, one of the stranger of which being
direct I/O. A new patch series from Jens Axboe looks to rationalize the
direct I/O path a bit, making it more flexible in the process.
The idea behind direct I/O is that data blocks move directly between the
storage device and user-space memory without going through the page cache.
Developers use direct memory for either (or both) of two reasons:
(1) they believe they can manage caching of file contents better than
the kernel can, or (2) they want to avoid overflowing the page cache
with data which is unlikely to be of use in the near future. It is a
relatively little-used feature which is often combined with another obscure
kernel capability: asynchronous I/O. The biggest consumers, by far, of this
functionality are large relational database systems, so it is not entirely
surprising that a developer currently employed by Oracle is working in this
area.
When the kernel needs to do something with an address space, it usually
looks into the associated address_space_operations structure for
an appropriate function. So, for example, normal file I/O are handled
with:
int (*writepage)(struct page *page, struct writeback_control *wbc);
int (*readpage)(struct file *filp, struct page *page);
As with the bulk of low-level, memory-oriented kernel operations, these
functions operate on page structures. When memory is managed at
this level, there is little need to worry about whether it is user-space or
kernel memory, or whether it is in the high-memory zone. It's all just
memory. The function which handles direct I/O looks a little different,
though:
ssize_t (*direct_IO)(int rw, struct kiocb *iocb, const struct iovec *iov,
loff_t offset, unsigned long nr_segs);
The use of the kiocb structure shows the assumption that direct
I/O will be submitted through the asynchronous I/O path. Beyond that,
though, the iovec structure pointing to the buffers to be
transferred comes directly from user space, and it contains user-space
addresses. That, in turn implies that the direct_IO() function
must itself deal with the process of getting access to the user-space
buffers. That task is generally handled in VFS-layer generic code, but
there's another problem: the direct_IO() function cannot be called
on kernel memory.
The kernel does not normally need to use the direct I/O paths itself, but
there is one exception: the loopback driver. This driver allows an
ordinary file to be mounted as if it were a block device; it can be most
useful for accessing filesystem images stored within disk files. But files
accessed via a loopback mount may well be represented in the page cache
twice: once on each side of the loopback mount. The result is a waste of
memory which could probably be put to better uses.
It would, in summary, be nice to change the direct_IO() interface
to avoid this memory waste, and to make it a little bit more consistent
with the other address space operations. That is what Jens's patch does. With that
patch, the interface becomes:
struct dio_args {
int rw;
struct page **pages;
unsigned int first_page_off;
unsigned long nr_segs;
unsigned long length;
loff_t offset;
/*
* Original user pointer, we'll get rid of this
*/
unsigned long user_addr;
};
ssize_t (*direct_IO)(struct kiocb *iocb, struct dio_args *args);
In the new API, many of the relevant parameters have been grouped into the
dio_args structure. The memory to be transferred can be found by
way of the pages_array. The higher-level VFS direct I/O code now
handles the task of mapping user-space buffers and creating the
pages array.
The impact on the code is, for the most part, small; it's mostly a matter
of moving the location where the translation from user-space address to
page structures is done. The current code does have a potential
problem in that it only processes one I/O segment at a time, possibly
creating performance problems for some kinds of applications. That mode of
operation is not really wired into the system, though, and can presumably
be fixed at some point.
The only other objection came from Andrew
Morton, who does not like the way Jens implemented the process of working
through the array of page structures. The index into this array
(called head_page) is built into struct dio and hidden
from the code which is actually working through the pages; that leads to
potential confusion, especially if the operation aborts partway through.
Andrew called it "a disaster waiting to happen" and
recommended that indexing be made explicit where the pages array
is processed.
That is a detail, though - albeit a potentially important one. The core
goals and implementation appear to have been received fairly well. It
seems highly unlikely that this code could be ready for the 2.6.32 merge
window, but we might see it aiming for the mainline in a subsequent
development cycle.
Comments (4 posted)
By Jonathan Corbet
August 26, 2009
The Linux Foundation recently
announced
the release of an updated version of its kernel authorship report,
co-written by LWN editor Jonathan Corbet. The information there is
interesting, but, since it stops with the 2.6.30 kernel, it also is ancient
history at this point. 2.6.30 came out two full
months ago, after
all. LWN readers, certainly, are used to more current information. Since
2.6.31 is getting close to ready, it seems like the right time to look at
this development cycle and see where the code came from.
As of this writing (just after the release of 2.6.31-rc7), the 2.6.31
development cycle had seen the incorporation of 10,663 non-merge changesets
from 1,146 individual developers. These patches added almost 903,000 lines
of code and removed just over 494,000 lines, for a net growth of just over
408,000 lines. According to Rafael Wysocki's August 25 report, this work
introduced 108 regressions into the kernel, 26 of which still lack a
resolution.
The largest individual contributors in the 2.6.31 development cycle were:
| Most active 2.6.31 developers |
| By changesets |
| Ingo Molnar | 276 | 2.6% |
| Peter Zijlstra | 260 | 2.4% |
| Paul Mundt | 204 | 1.9% |
| Takashi Iwai | 150 | 1.4% |
| Bartlomiej Zolnierkiewicz | 149 | 1.4% |
| Steven Rostedt | 139 | 1.3% |
| Tejun Heo | 134 | 1.3% |
| Johannes Berg | 133 | 1.2% |
| Magnus Damm | 119 | 1.1% |
| Mike Frysinger | 115 | 1.1% |
| roel kluin | 105 | 1.0% |
| Greg Kroah-Hartman | 101 | 0.9% |
| Erik Andrén | 100 | 0.9% |
| Paul Mackerras | 85 | 0.8% |
| Mark Brown | 85 | 0.8% |
| Bill Pemberton | 82 | 0.8% |
| Jaswinder Singh Rajput | 79 | 0.7% |
| Ben Dooks | 72 | 0.7% |
| Joe Perches | 72 | 0.7% |
| Alexander Beregalov | 71 | 0.7% |
|
| By changed lines |
| Bartlomiej Zolnierkiewicz | 220749 | 18.3% |
| Jerry Chuang | 78441 | 6.5% |
| Forest Bond | 50834 | 4.2% |
| David Daney | 40052 | 3.3% |
| Jerome Glisse | 38604 | 3.2% |
| Vlad Zolotarov | 23260 | 1.9% |
| Ingo Molnar | 22614 | 1.9% |
| James Smart | 19209 | 1.6% |
| Bill Pemberton | 17249 | 1.4% |
| dmitry pervushin | 14532 | 1.2% |
| Greg Kroah-Hartman | 13234 | 1.1% |
| Wai Yew CHAY | 12741 | 1.1% |
| Michael Chan | 11887 | 1.0% |
| Linus Walleij | 11626 | 1.0% |
| Paul Mundt | 10735 | 0.9% |
| Peter Zijlstra | 10202 | 0.8% |
| Zhu Yi | 10197 | 0.8% |
| Ben Dooks | 10150 | 0.8% |
| Johannes Berg | 9532 | 0.8% |
| Kalle Valo | 9263 | 0.8% |
|
Ingo Molnar always shows up near the top of the changeset statistics. He
has, as usual, contributed work all over the core kernel and x86
architecture code, but the bulk of his work this time is in the performance
counters code; most of Peter Zijlstra's contributions were also in this
area. The merging of this fast-changing subsystem caused those two
developers to be responsible for 5% of the patches going into the 2.6.31
release. Paul Mundt wrote a vast number of Super-H architecture patches,
and Takashi Iwai contributed large numbers of ALSA patches.
#5 on the changesets list is Bartlomiej Zolnierkiewicz, who also comes out
on top in terms of the number of lines changed. He contributed a few IDE
patches, despite having handed off responsibility for that subsystem, but
most of his work went into the cleaning-up of Ralink wireless drivers in
the staging tree. This cleanup resulted in the removal of an amazing
208,000 lines of code. Jerry Chuang added the RealTek RTL8192SU wireless
driver (to staging), Forest Bond added the VIA Technologies VT6655 driver
(to staging), David Daney did a bunch of MIPS work (including adding the
Octeon Ethernet driver to the staging tree), and Jerome Glisse added kernel
mode setting support for Radeon graphics chipsets.
As we have seen in the past few development cycles, the staging tree
is the source of much of the change in the kernel tree. The nature of that
change is, itself, changing, though. The rush of adding out-of-tree
drivers to the staging tree has slowed considerably; we're starting to see
more work dedicated to fixing up the code which is already there.
The developers contributing to 2.6.31 were supported by a minimum of 194
employers. The most active of those were:
| Most active 2.6.31 employers |
| By changesets |
| (None) | 1704 | 16.0% |
| Red Hat | 1587 | 14.9% |
| Intel | 878 | 8.2% |
| (Unknown) | 846 | 7.9% |
| IBM | 667 | 6.3% |
| Novell | 614 | 5.8% |
| Renesas Technology | 345 | 3.2% |
| Fujitsu | 223 | 2.1% |
| (Consultant) | 212 | 2.0% |
| Analog Devices | 212 | 2.0% |
| Oracle | 175 | 1.6% |
| Nokia | 131 | 1.2% |
| AMD | 129 | 1.2% |
| Atheros Communications | 118 | 1.1% |
| MontaVista | 104 | 1.0% |
| Xelerated AB | 100 | 0.9% |
| (Academia) | 92 | 0.9% |
| NetApp | 91 | 0.9% |
| HP | 86 | 0.8% |
| Wolfson Microelectronics | 85 | 0.8% |
|
| By lines changed |
| (None) | 311803 | 25.8% |
| Red Hat | 124831 | 10.3% |
| Realtek | 78441 | 6.5% |
| Intel | 62559 | 5.2% |
| Broadcom | 51806 | 4.3% |
| Logic Supply | 51401 | 4.3% |
| (Unknown) | 47165 | 3.9% |
| Cavium Networks | 40086 | 3.3% |
| IBM | 39991 | 3.3% |
| Novell | 31979 | 2.6% |
| Renesas Technology | 31674 | 2.6% |
| (Consultant) | 23659 | 2.0% |
| Emulex | 19209 | 1.6% |
| University of Virginia | 17607 | 1.5% |
| Nokia | 16234 | 1.3% |
| Embedded Alley Solutions | 15229 | 1.3% |
| Creative Technology | 12741 | 1.1% |
| Oracle | 11704 | 1.0% |
| Analog Devices | 10760 | 0.9% |
| Texas Instruments | 10639 | 0.9% |
|
The top group in either category is developers working on their own time,
followed by Red Hat, which merged a few large chunks of code this time
around.
A look at non-author signoffs (a hint as to which subsystem maintainers
admitted the patches into the mainline) shows a continuation of recent
trends:
| Top non-author signoffs in 2.6.31 |
| Individuals |
| David S. Miller | 964 | 10.1% |
| Ingo Molnar | 948 | 9.9% |
| Greg Kroah-Hartman | 582 | 6.1% |
| John W. Linville | 575 | 6.0% |
| Andrew Morton | 569 | 6.0% |
| Mauro Carvalho Chehab | 535 | 5.6% |
| Linus Torvalds | 254 | 2.7% |
| James Bottomley | 237 | 2.5% |
| Benny Halevy | 191 | 2.0% |
| Paul Mundt | 159 | 1.7% |
|
| Employers |
| Red Hat | 3686 | 38.7% |
| Novell | 1061 | 11.1% |
| Intel | 829 | 8.7% |
| Google | 572 | 6.0% |
| (None) | 422 | 4.4% |
| IBM | 383 | 4.0% |
| Linux Foundation | 254 | 2.7% |
| Oracle | 228 | 2.4% |
| Panasas | 193 | 2.0% |
| (Consultant) | 168 | 1.8% |
|
49.8% of the patches going into the mainline for 2.6.31 passed through the
hands of developers working for just two companies: Red Hat and Novell.
Linux kernel developers work for a large number of companies, but subsystem
maintainers are increasingly concentrated in a very small number of places.
In summary, it is a fairly typical development cycle for the kernel in
recent times. The number of changes is high (but not a record), as is the
number of developers. The transient effect of the staging tree is
beginning to fade; it is becoming just another path for drivers heading
into the mainline. As a whole, the process seems to be functioning in a
smooth and robust manner.
(As always, your editor would like to thank Greg Kroah-Hartman for his
assistance in the preparation of these statistics.)
Comments (1 posted)
August 26, 2009
This article was contributed by Jon Ashburn
One downside to the ever-increasing memory size available on computers
is an increase in memory failures. As memory density increases, error
rates also rise. To offset this increased error rate, recent processors
have included support for "poisoned" memory, an adaptive method for
flagging and recovering from memory errors. The HWPOISON patch recently
developed by Andi Kleen and Fengguang Wu provides the Linux kernel
support for memory poisoning. Thus, when HWPOISON is coupled with the
appropriate fault-tolerant processors, Linux users can enjoy systems that
are more tolerant to memory errors in spite of increased memory
densities.
Memory errors are classified as either soft (transient) or hard
(permanent). In soft errors, cosmic rays or random errors can toggle
the
state of a bit in a SRAM or DRAM memory cell. In hard errors, memory cells
become physically degraded. Hardware can detect - and automatically
correct -
some of these errors via Error Correcting Codes (ECC). While single bit
data errors can be corrected via ECC, multi-bit data errors cannot. For
these uncorrectable errors, the hardware typically generates a trap which,
in turn,
causes a kernel panic.
The blanket action of crashing the machine for all uncorrected soft and
hard memory errors is sometimes over-reactive. If the detected memory
error never actually corrupts executing software, then ignoring or
isolating the error is the most desirable action. Memory "poisoning", with
its delayed handling of errors, allows for a more graceful recovery from
and isolation of uncorrected memory errors rather than just crashing the
system. However, memory poisoning requires both hardware and kernel
support.
The HWPOISON patch is very
timely: Intel's recent preview of its Xeon
processor (codenamed Nehalem-EX) promises support for memory poisoning.
Intel has included its Machine Check Abort (MCA) Recovery
architecture in Nehalem-EX. Originally developed for ia64
processors, Intel's MCA Recovery architecture supports memory poisoning
and various other hardware failure recovery mechanisms. While, HWPOISON
adopted Intel's usage of the term "poisoning", this should not be confused
with the unrelated Linux kernel concept of poisoning: writing a pattern to
memory to catch uninitialized memory.
While the specifics of how hardware and the kernel might implement
memory poisoning varies, the general concept is as follows. First,
hardware detects an uncorrectable error from memory transfers into the system
cache or on the system bus. Alternatively, memory may be occasionally
"scrubbed." That is, a background process may initiate an ECC check on one or
more memory pages. In either case, the hardware doesn't immediately cause
a machine check but rather flags the data unit as poisoned until read (or
consumed). Later, when erroneous data is read by executing software, a
machine check is initiated. If the erroneous data is never read, no
machine check is necessary. For example, a modified cache line written
back to main memory may have a data word error that is marked as poisoned.
Once the poisoned data is actually used (loaded into a processor register,
etc.), a machine check occurs, but not before. Thus, any poisoning machine
check event may happen long after the corresponding data error event.
HWPOISON is a poisoned data handler invoked by the low-level Linux
machine check code. Where possible, HWPOISON attempts to gracefully
recover from memory errors, and contain faulty hardware to prevent future
errors. At first glance, an obvious solution for the poison handler would
focus on the specific process and memory address(es) associated with the
data error. However, this is infeasible for two reasons. First, the
offending instruction and process cannot be determined due to delays
between the data error consumption and execution of the poison handler.
These delays include asynchronous hardware reporting of the machine check
event, and delayed execution of the handler via a workqueue. Thus, a
different process may be executing by the time the HWPOISON handler is
ready to act. Second, bad-memory containment must be done at a level
where the kernel actually manages memory. Thus, HWPOISON focuses on memory
containment at the page granularity rather than the low granularity
supported by Intel's MCA Recovery hardware.
HWPOISON finds the page containing the poisoned
data and attempts to isolate this page from further use. Potentially
corrupted processes can then be located by finding all processes that have
the corrupted page mapped. HWPOISON performs a variety of different
actions. Its exact behavior depends upon the type of corrupted page and
various kernel configuration parameters.
To enable the HWPOISON handler, the kernel configuration parameter
MEMORY_FAILURE must be set. Otherwise, hardware poisoning will cause a
system panic. Additionally, the architecture must support data poisoning.
As of this writing, HWPOISON is enabled for all architectures to make
testing on any machine possible via a user-mode fault injector, which is
detailed below.
The handler must allow for multiple poisoning events occurring in a
short time window. HWPOISON uses a bit in the flags field of a
struct page to mark and lock a page as poisoned. Since page flags
are currently in short supply, this choice was not made without
consternation and debate by kernel hackers. See this LWN article for further
details about this issue. In any case, this bit allows previously poisoned
pages to be ignored by the handler.
The handler ignores the following types of pages: 1) pages that have
been previously poisoned, 2) pages that are outside of kernel control (an
invalid page frame number), 3) reserved kernel pages, and 4) pages with usage count of
zero, which implies either a free or higher order kernel page. The
poisoned bit in the flags field serves as a lock allowing rapid-fire
poisoning machine checks on the same page to be handled only once by
ignoring subsequent calls to the handler. Reserved kernel pages and zero
count pages are ignored with the peril of a system panic. However, these
pages containing critical kernel data cannot be isolated. Thus, HWPOISON has
no useful options for recovery.
In addition to ignoring pages, possible HWPOISON actions include
recovery, delay, and failure. Recovery means HWPOISON took action to
isolate a page. Ignore, failure, and delay are all similar in that the
page was not completely isolated, except for flagging the page as poisoned.
With delay, handling can be safely postponed until a later time when the
page might be referenced. By delaying, some transient errors may not
reoccur or may be irrelevant. HWPOISON delays any action on kernel slab or
buddy allocator pages or free pages. With failure, HWPOISON could, but
does not support handling the page. HWPOISON takes an action of failure
on unknown or huge pages. Huge pages fail since reverse mapping is not
supported to identify the process which owns the page.
Clean pages in either the swap or page cache can be easily recovered by
invalidating the cache entry for these pages. Since these pages have a
duplicate backing copy on disk, the in-memory cache copy can be
invalidated. Unlike clean pages, dirty pages in these caches have
differences between the memory and disk copies. Thus, poisoned dirty
pages may have important data corruption. However, dirty pages in the
page cache are recovered by invalidation of the cache. Additionally, a page
error is set for the dirty page cache page so subsequent user system calls
on the file associated with the page will return an I/O error. Dirty pages
in the swap cache are handled in a delayed fashion. The dirty flag is
cleared for the page and the page swap cache entry is maintained. On a
later page fault the associated application will be killed.
To recover from poisoned, user-mapped pages, HWPOISON first finds all
user processes which mapped the corrupted page. For clean pages with
backing store, HWPOISON need not take recovery action since the process
does not need to be killed. Dirty pages are unmapped from all associated
processes, which are subsequently killed. Two VM sysctl
parameters are supported by HWPOISON with respect to killing user
processes: vm.memory_failure_early_kill and
vm.memory_failure_recovery. Setting the
vm.memory_failure_early_kill parameter causes an immediate SIGBUS
to be sent to the user process(es). The kill is done using a catchable
SIGBUS with BUS_MCEERR_AO. Thus, processes can decide how they want to
handle the data poisoning. The vm.memory_failure_recovery
parameter delays the killing: the page is merely unmapped by HWPOISON.
When this unmapped page is actually referenced at a later time then a
SIGBUS will be sent.
An HWPOISON patch git repository is available at
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6.git hwpoison
Since faulty hardware that supports data poisoning is not easy to come by,
a fault injection test harness mm/hwpoison-inject.c has also been
developed. This simple harness uses debugfs to allow failures at an
arbitrary page to be injected.
While HWPOISON was developed for x86-based machines, interest has been
expressed by supporters of other Linux server architectures, such as ia64
and sparc (discussed
here). Thus, the patch may proliferate on future Linux server
distributions, allowing users of future Linux servers to enjoy increased
fault tolerance. Now that Intel is supporting MCA Recovery on x86 machines,
some desktop users may also enjoy its benefits in the near future.
Comments (21 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>