LWN readers will have seen our reporting from the Linux Storage and
Filesystem Summit (
day 1,
day 2), held on
August 8 and 9 in Boston. Your editor was unable to attend the
storage-specific sessions, though, so they were not covered in those
articles. Fortunately, James Bottomley took detailed notes, which he has
now made available to us. Many thanks to James for all of what follows.
BSG and iSCSI - Mike Christie; FUJITA Tomonori
The main iSCSI discussion centred around offloads, TCP offload engines
(ToE) and flow steering. The view expressed is that ToE is only acceptable
if the card processes only iSCSI and doesn't need to integrate with the
network stack; otherwise, it needs to do segment offloading and flow
steering, although it may process full iSCSI further up the stack.
For BSG, the main discussion centred around vendor specific commands
which can now be transmitted through the interfaces. James Bottomley
would be very unhappy if common functions ended up becoming a mosaic
set of vendor specific commands. The question was asked whether a
common API should be implemented (via a tool) in userspace which then translated to the
myriad vendor specific commands. After some discussion, it was agreed
that this wasn't a good way forwards, and achieving commonality
would be advanced better by standardising the actual BSG commands
directly via the Linux API
SAN Management - Jacob Cherian; Joel Becker
Jacob went over the basics that vendors are looking for, most of which centres
around virtual machine provisioning using pieces of the HBAAPI.
Joel
then presented a tool which Oracle has been working on to do exactly
this. Written in python, it should allow the necessary pluggable
modules for individual arrays to make this functionality universally
available. Initial feedback was that python is less than desirable
for this, since ultimately there are many tools that need to use this
functionality, so something lower level like a C library would be
better. Unfortunately, Joel announced that, while this work would
eventually be available in open source (no time frame could currently
be given), Oracle was working under NDA with some vendors and was
unable to release source code at this time. Considerable annoyance
was expressed by all others present about this, particularly around the
fact that this is exactly the wrong way to work with the community
since Oracle will be presenting a fait accompli with no chance for
input. Representatives of the distributions were especially incensed
since they already have people working on aspects of this, and will
likely be duplicating effort.
Error Handling - Hannes Reinecke
The first topic up for discussion was Unit Attention (UA), which is
the SCSI way of signalling the fact that some event occurred
asynchronously. Particular UAs of interest were target
reconfiguration ones, but there was general agreement that the SCSI
stack should have a UA handling infrastructure that took the
information all the way up to user space.
Next up was error processing, which SCSI still does a poor job of,
especially with regard to
translating the error codes for upward consumption (in dm and fs).
The first proposal, which is easy to implement, is to copy the SCSI
result directly to the request result unconditionally, and to copy up
the sense information if the request has a sense buffer. However, in
the long run, the device mapper doesn't really want to be responsible for doing this
translation. The old idea of having SCSI send up an additional
classification with the code into a couple of axes: Fatal vs Retryable,
and "error occurred in Device, Transport or Host." This would allow the
DM software to take more intelligent decisions about path switching
(for device errors, there's usually no point switching because the
device will give the same error). There was some discussion of
whether this translation could be achieved using the device mapper
SCSI handlers, after some discussion on the point, it was agreed that
the answer wouldn't be known until the code was produced.
FC Sysfs and FCoE - James Smart; Robert Love
This session overran badly and was awarded another slot on Day 2 where
it will be covered in detail.
Thin Provisioning - Mark Ruder; Jacob Cherian
Mark and Jacob presented the position from the point of view of their
arrays (Equallogix) where thin provisioning is used to claim the array
has more storage space than it actually has, allowing users to buy
more as they approach the limits. The arrays need to be told when
space that was written to is no longer in use, but this is nicely handled by
the current discard code, being translated to unmap or write_same in
the SCSI layer. One of the problems arrays have is that their mapping
blocks are quite large (usually 500k-1MB) so any trim the kernel sends
down will be aligned to them (and anything smaller will just be
ignored). This makes the idea of accumulating discards important to
these arrays. There was a general discussion about whether online
discard (vs offline run periodically from cron) was more useful to
these arrays. The general consensus was that online discard was
important, but that running offline every few days to pick out
segments that might have been missed due to alignment issues would
also be useful.
The next issue was about space availability warnings (the arrays
produce unit attention conditions when they run short of physical
space). General consensus was that we could print out the condition
for the sysadmin to rectify, but the filesystems themselves couldn't
really do anything with the knowledge.
Trim and Discard - Martin Petersen
Martin presented the current situation around trim and discard. In
SCSI, the problem is that array manufacturers have settled on two
representations: UNMAP, which can take multiple ranges, and WRITE SAME,
which only takes a single range. Since they're both optional, the
array is allowed to pick which one it supports. Fortunately, T10
finally came through with a way for us to tell which one the array
wants, so the fallback problem is now largely solved.
Unfortunately,
the current problem with ATA devices is that although the TRIM command
takes ranges, they're limited to a maximum size of 64k blocks (or 32MB
at a time on 512 byte sector devices). Therefore, there's a
considerable difficulty in large trims such as the ones done for
gigabyte file deletion or mkfs which trims the entire partition.
Attention also turned to the behaviour of trim on SATA devices
supporting NCQ: since Trim isn't a queued command, it means that, to
send a trim, the device has to complete all queued commands then send
a trim on its own, then allow the tags to build back up. This causes
a pause in the disk writes which can damage performance. Matthew
Wilcox said the non-queued problem had been pointed out to T13 (the
standards body governing ATA) and a queued version might be in the
works. No solution is currently proposed for the limited range
problem.
SSD continuation - Eric Seppanen; Matthew Wilcox
After the conclusion of the plenary session on SSD, the I/O track
retired to deal with I/O specific issues. First up was the request
queue. Eric noted that the current crop of PCIe drivers were trying
to do without the request queue entirely. This was largely thought to
be a bad idea, since it removes the device from the current priority-
and cgroup-based bandwidth QoS updates which are looking to be
necessary, at least in virtual environments. Discussion then passed on
to reducing the stack latency in the block layer by reducing the number of
allocations made, particularly around sense data handling; old ideas
about a fixed number of rotating sense buffers were given an airing as
was having a rotating queue of ready requests and SCSI commands. The
basic advice was to move to using a request queue and we'd see if we
could reduce the latency.
Discussion next turned on MSI-X and multiqueue interrupt steering, the
idea being to keep data cache hot by returning it to the same CPU
issuing the request. Eric noted that really it doesn't have to be the
same CPU, just the same non-L1 cache (so same package but not
necessarily same core). Jens noted that the block layer kept this
information today in req->cpu but that it was up to the driver to use
this information to make the data enter the correct queue and the
MSI-X to return on the right CPU. It was noted that we can use the
current IRQ affinity interface to bind the MSI-X interrupts on a
per-cpu basis, but that there was no generic interface to set them up
correctly (so perhaps there should be).
Finally, developers noted that the best way to get them thinking about
the problems was to give them hardware (hint). Matthew Wilcox noted
that even the silicon vendors didn't really have the latest hardware
and that his current NVMCHI rig consisted of two x86 boxes joined by a
PCIe bus (one to act as host and one to act as device).
CFQ Performance - Jens Axboe; Vivek Goyal
Vivek began by presenting a series of slides showing performance
graphs where CFQ was outperformed by either deadline or noop
scheduling on a set of plausible workloads. The basic problem seemed
to be with arrays, CFQ was mostly on par on either single disk or JBOD
configurations. Following discussions, it was decided that the most
likely reason for the discrepancy was the CFQ idling function (which
tries to figure out wait times to see if more mergeable requests
arrive) and CFQ's deliberate design to keep a low queue depth (which
is fairly ideal for desktops, but which penalises arrays where
queuing is used to keep the cache flooded). After further discussion
it was decided that we could use a binary switch within CFQ to flip
between the two cases. The optimal way seems to be to assume desktop
(low Queue depth, idling) but switch to array either via a whitelist
or possibly simply by SATA vs non-SATA disk detection.
Libata Error Reporting - Gwendal Grignou
Gwendal outlined updates to libata to support error handling, the most
useful being an error buffer in sysfs that would report on the last
several ATA errors encountered.
Another issue is the ability to control PHYs (the endpoints in a SATA
setup), both on the host and on the port multiplier. James Bottomley
noted that PHY control is already part of the SAS transport class, so
the only real need was for some libata transport class to complement
the functionality. Gwendal said he already had a proposal for this,
but that, unfortunately, libata discovery didn't really work in the
transport class infrastructure.
The essential problem is that whereas
SCSI does nothing to configure devices until transport_setup_device()
time, ATA has already probed and set up the PHYs before SCSI even
knows about them (so before transport_add_device() which is the first
indication of device presence transport classes get). There was
general discussion on the point, but no satisfactory resolution since
libata isn't part of the SCSI discovery domain. The best way forward
is obviously to remove libata from SCSI so it can do it's own
transport class setups at the correct times. However, an interim fix
might be to use an intermediate device representing the libata parent
and attach the transport class to that (somewhat like the way SAS and
FC set up intermediate devices in their SCSI tree).
Target Mode/Config FS - Nic Bellinger
Nic presented the current state of play of the LIO SCSI in-kernel
target mode driver. James Bottomley noted that there were three
conditions any target driver needed to satisfy before acceptance:
- That it would be a drop in replacement for STGT (our current
in-kernel target mode driver), since he only wanted a single SCSI
target infrastructure.
- That it used a modern sysfs based control and configuration plane.
- That the code was reviewed as clean enough for inclusion.
Fujita Tomonori reported that STGT was now happy that the current LIO
implementation provided a complete replacement that allowed STGT
userspace modules to continue functioning, so the STGT community was
happy to support its inclusion as replacing STGT.
Nic reported on the current configfs (a sysfs extension for
transporting configuration data) interface to LIO which allows
basically the full control of setup and tear down of targets from user
space.
The first two points being satisfied, Christoph Hellwig had already
begun a top to bottom review of the code (resulting in the elimination
of 2000 lines so far, with more expected on the way). There was
general agreement that pending resolution of the third point by a
successful code review from Christoph, LIO might be ready for
inclusion in main line by the 2.6.37 merge window.
There was a final question about how this should happen and James
stated his preference for seeing the minimal set of patches across the
SCSI mailing list to allow for more eyes to review.
FC Sysfs and FCoE - James Smart; Robert Love
Robert kicked off by presenting the current state of play in Fibre
Channel over Ethernet which, all in all, was pretty positive. James
Smart then took over with a detailed explanation of how the latest
Fibre Channel layer 4 mapping (FC-4) would unify FC and FCoE. The
essential problem this poses for SCSI is how to represent the new FC-4
mappings in sysfs. The difficulty is that there's a long path
(possibly traversing both standard fibre and Ethernet) from the HBA to
the rport. Additionally, the protocol introduces a set of non-SCSI
end points as well (which should also have sysfs representation).
Although it was fairly easy to attach the target and the device, the
SCSI host was a bit more of a challenge. After trying various places
in the proposed tree, no good single location could be found and it
was agreed to go with the current proposal, take the patches to the
list to see if anyone else can come up with something better.
Multipath - Alasdair Kergon
Alasdair kicked off a discussion on routing within the block layer.
Namely the problem that, unlike in the networking subsystem, we have to
name the end point
device in the bios we send down. This causes several problems for the
device mapper, since it must rewrite the bio destination several times
as processing progresses in the device mapper stack. It also means
that, once the user has mounted or in any way opened a block device
(except a device mapper one), it's impossible to reroute the I/O,
meaning that if you want to install additional functionality, you have
to unmount (or in extreme cases for root filesystems, reboot) the
system.
The first approach discussed was to add generic routing to the block layer.
After an exploration of implementation details, it was realised that
the only way to make this work is to detach the gendisk structure at
the top of the stack from the request queue at the bottom (because the
one to one mapping would be broken and we'd be routing over a graph
from the gendisk to the selected queue). Unfortunately, this would
involve us in a massive block driver rewrite because, although there
are subsystems like SCSI that happily operate with only a request
queue and don't care about the gendisk, the idea that the two are
closely tied together is firmly enshrined in block driver coding.
The next approach turned this concept on its head and asked: what if
every device were a device mapper? There seemed to be only two issues
in the way of this: firstly that user space would no longer be able to
care about device major numbers (they'd become pure fiction instead of
mostly fiction) and secondly, device mapper would have to
automatically create a pass through routing for any arbitrary device
that the kernel opens. The first should largely be solved by udev and
the second looks, on the face of it, to be fairly simple., so it was
agreed to take this idea to the list for further exploration (Alasdair
to do the writeup).
(
Log in to post comments)