[
Editor's note: our recent coverage from the 2009 Linux Storage and
Filesystem Workshop (
day 1,
day 2) contained no notes
from the storage track - an unfortunate result of your editor's inability
to be in two places at the same time. Happily, James Bottomley took good
notes, which he has now made available to us to publish. Many thanks to
James for providing these notes to us.]
Storage Track: Day 1
Multipathing
The day began with a review of request-based multipathing, which is
trying to solve the problem that the device mapper operates
multipathing way above the block layer and thus has great difficulty
getting accurate data from and control of the underlying devices,
which are often much lower, way below the block subsystem. The agreed
solution to this problem is to implement a request-based layer for
multipathing which can sit underneath the block layer and provide a
bidirectional conduit to the device mapper above the block layer such
that status and event processing of multipath related information can
occur much closer to where the actual events are generated.
Hannes Reinecke presented the current state of Multipath which, from a
SCSI and block point of view, is generally good: all of the necessary
SCSI and block components for request-based multipathing are now
upstream. We also have specific device-based multipath plugins which
handle the vagaries of the multipath implementation in certain well-known
storage arrays and ensure that path failure and path switchover
occur much more expediently. The remaining code changes are for the
device mapper layer which are sitting in Alasdair Kergon's tree but
which aren't yet upstream.
The remaining two problems discussed were fast fail for the path
checker daemon. Device mapper uses a daemon to ping the path
periodically to make sure it is alive. Unfortunately if something bad
happens and error handling takes over, it can be many seconds before
the ping returns. Several ways of handling this were discussed, but
it was agreed that some type of uevent when error handling is invoked
on the device, thus informing the path checker that something was
wrong, would be implemented.
Finally, the problems with last path disconnection were debated.
There are two competing tensions here: the first is that the device
mapper queue_if_no_path setting, which most people use, has the
problem that if all paths are failed it will hang on to outstanding
I/O forever until a path returns. Worse, if the device is backing a
filesystem, the normal I/O writeback path will be filling the system up
with dirty pages that cannot be cleaned because writeout is stuck.
Finally, the system can get to a point where even if a path returns,
we don't have enough memory left to do all the operations necessary to
reattach the path, which means the system is live-locked on dirty
memory and never recovers. After a detour into mempool
implementations, it was agreed that, given device mapper's reliance on a
user-space daemon for path attachment, it was impossible to protect all
allocations such that the user space daemon can do its job. Thus, the
final agreed solution was to add a separately tunable timeout to the
queue_if_no_path case to set a wait time beyond which device mapper
will error out all I/O in the queue (thus freeing the dirty memory)
and allow the system to recover as best it can.
SCSI Target Implementations
Unfortunately, I had a conflicting meeting, so there's no summary of
this one.
ATA Issues
Tejun Heo led off with a request for help on a particularly knotty
problem: enabling zero length barriers for ATA devices has lead some
of them to power off briefly when a barrier flush is issued.
Unfortunately, this shows up as a brief glitch on the SATA bus, but
causes the entire drive cache to be trashed, leading to the loss of
data. The greatest problem here is detecting when drives actually do
this, because the event is indistinguishable from a bus fluctuation.
Several root causes were discussed, including the possibility that the
problem is created by power saving within multiple drives causing them
to power up briefly to flush the cache, overwhelm the power supply and
thus lose the data. In the end it was agreed that the best way of
detecting this situation is from user space, using the smart power cycle counter
(The reason for using a user space daemon is that there's no standard
way of getting at the drive SMART data for ATA drives, although SCSI
does have a standard mode page).
Discussion then moved on to the current status of getting libata out
of SCSI: we have had several successes, notably timer handling and
pieces of error handling have moved up to block. Unfortunately, the
current progress has reached the point where it's being impeded by the
legacy IDE subsystem which is still relying on some very old fields
and undocumented behaviour of the block layer, since the next step is
to simplify the block to low level interface and move to a more exact
and well understood API. No solutions were proposed, but Tejun will
continue on trying to clean up both block and drivers/ide in parallel
to achieve this.
Storage Track - Day 2
I/O Scheduling and Tracing
This was a joint session presented by Fernando Cao, Vivek Goyal and Nauman Rafique.
Following feedback from the summit last year that an approach which
worked only for the CFQ scheduler was unacceptable, the group had
moved on to consider a layered approach that would provide a
throughput-aware scheduling service that worked for all services (it
was pointed out that the name for this: "I/O Controller" unfortunately
caused most practitioners immediately to think of I/O Cards, and thus
might need changing).
The problems with the I/O Controller approach, which NTT voiced
through Fernando, is that, in an advanced storage topology, it was very
hard to ensure consistent throughput to virtual machines which sit way
above the I/O controller (which only sees the physical devices, not
even the low layer LVM or virtual devices). After several rounds of
discussion, it was agreed that this was a problem, but that the
correct place for the solution was still at the I/O scheduler level.
However, we would also run a lightweight daemon closer to the virtual
machines, very similar to irqbalanced, whose job was to keep an eye on
the virtual machine I/O throughput and adjust the low level I/O
schedulers periodically to achieve fairness to the virtual machines
(and ensure they achieved their targeted I/O bandwidth).
The last problem discussed was that of correctly accounting for I/O operations
as they passed through the I/O scheduler. I/O is accounted through an
I/O context which is attached to the bio as it passes through the
scheduler (meaning it is accounted to whatever process happens to be
current at the time). Unfortunately, this means that a lot of I/O is
incorrectly accounted to either the pdflush thread, or to random
processes that happen to be current when the I/O ended up being
submitted. In order to do the accounting correctly, some type of per-page
pointer to the io_context needs to be added. The feedback from
last year leaned very strongly toward the thought that expanding
struct page, which is critical-path infrastructure, would not be
appreciated, so Fernando's proposal was to have a second array, indexed by
page frame number, in which the I/O context could be stored. This
method would be tunable by a config option, but would occupy about 1KB
per megabyte of memory at boot time. After some discussion of the
memory cost, it was agreed that this probably represented about the
only way forward, but that the secondary struct page mechanism should
be made available for other processes that might need to attach
pointers to struct page but which also didn't need to be in the
critical path.
Finally, the entire group took an action to send their current patch
queue to both the Linux kernel mailing list and the device mapper
list.
Fibre Channel Transports
James Smart reviewed the current status of the much maligned SNIA HBA
API for Fibre Channel. When originally promulgated, this standard
mandated a large and ugly ioctl() layer sitting in all drivers and was
roundly rejected by the kernel developers. Since that time, most of
the functionality required by the HBAAPI has been implemented in the
fibre channel transport class as sysfs attributes. The only remaining
implementation piece is the ability to transport arbitrary FCP (Fibre
Channel Protocol) frames onto the SAN. Version five of the patch is
to do this (similar to the way SAS does expander frame communication
based on the Linux standard SG_IO ioctl) is currently circulating on
the SCSI mailing list and is expected to be merged in 2.6.31. The
mechanism is slightly more complex than the SAS API since the frame
command is defined by the implementations, not by FCP, with the
response data communicated in the sense buffer and the response frame
in the bidirectional in data.
The last remaining problem is that of receiving asynchronous events
from the fibre. After a bit of discussion, it was agreed that using
SG_IO to queue receive frames and some type of netlink message or uevent to
signal their arrival was probably an optimal solution for this. James
agreed to return to the mailing list with patches implementing this for
review. With this last piece, equivalent functionality to the HBAAPI
(with even one to one mapping in user space) should now be available
on Linux.
Raid Unification
Christoph Hellwig reported that the majority of the discussion had
been covered in the original plenary session and that not much
remained to be covered in the storage specific session.
Dan Williams took us through an implementation he was working on that
provided for sysfs reporting and configuration of md raid volumes.
James Bottomley noted that this was very similar to the way the
original RAID transport class had been implemented and Dan agreed to
look at this to determine if he could provide a generic mechanism that
would provide initially for a unified view of RAID volumes that would
be identical between the various hardware and software RAID solutions
and which finally might provide us with a unified configuration
interface for each of them.
(
Log in to post comments)