User: Password:
|
|
Subscribe / Log in / New account

Notes from the LSF summit storage track

LWN readers will have seen our reporting from the Linux Storage and Filesystem Summit (day 1, day 2), held on August 8 and 9 in Boston. Your editor was unable to attend the storage-specific sessions, though, so they were not covered in those articles. Fortunately, James Bottomley took detailed notes, which he has now made available to us. Many thanks to James for all of what follows.

BSG and iSCSI - Mike Christie; FUJITA Tomonori

The main iSCSI discussion centred around offloads, TCP offload engines (ToE) and flow steering. The view expressed is that ToE is only acceptable if the card processes only iSCSI and doesn't need to integrate with the network stack; otherwise, it needs to do segment offloading and flow steering, although it may process full iSCSI further up the stack.

For BSG, the main discussion centred around vendor specific commands which can now be transmitted through the interfaces. James Bottomley would be very unhappy if common functions ended up becoming a mosaic set of vendor specific commands. The question was asked whether a common API should be implemented (via a tool) in userspace which then translated to the myriad vendor specific commands. After some discussion, it was agreed that this wasn't a good way forwards, and achieving commonality would be advanced better by standardising the actual BSG commands directly via the Linux API

SAN Management - Jacob Cherian; Joel Becker

Jacob went over the basics that vendors are looking for, most of which centres around virtual machine provisioning using pieces of the HBAAPI. Joel then presented a tool which Oracle has been working on to do exactly this. Written in python, it should allow the necessary pluggable modules for individual arrays to make this functionality universally available. Initial feedback was that python is less than desirable for this, since ultimately there are many tools that need to use this functionality, so something lower level like a C library would be better. Unfortunately, Joel announced that, while this work would eventually be available in open source (no time frame could currently be given), Oracle was working under NDA with some vendors and was unable to release source code at this time. Considerable annoyance was expressed by all others present about this, particularly around the fact that this is exactly the wrong way to work with the community since Oracle will be presenting a fait accompli with no chance for input. Representatives of the distributions were especially incensed since they already have people working on aspects of this, and will likely be duplicating effort.

Error Handling - Hannes Reinecke

The first topic up for discussion was Unit Attention (UA), which is the SCSI way of signalling the fact that some event occurred asynchronously. Particular UAs of interest were target reconfiguration ones, but there was general agreement that the SCSI stack should have a UA handling infrastructure that took the information all the way up to user space.

Next up was error processing, which SCSI still does a poor job of, especially with regard to translating the error codes for upward consumption (in dm and fs). The first proposal, which is easy to implement, is to copy the SCSI result directly to the request result unconditionally, and to copy up the sense information if the request has a sense buffer. However, in the long run, the device mapper doesn't really want to be responsible for doing this translation. The old idea of having SCSI send up an additional classification with the code into a couple of axes: Fatal vs Retryable, and "error occurred in Device, Transport or Host." This would allow the DM software to take more intelligent decisions about path switching (for device errors, there's usually no point switching because the device will give the same error). There was some discussion of whether this translation could be achieved using the device mapper SCSI handlers, after some discussion on the point, it was agreed that the answer wouldn't be known until the code was produced.

FC Sysfs and FCoE - James Smart; Robert Love

This session overran badly and was awarded another slot on Day 2 where it will be covered in detail.

Thin Provisioning - Mark Ruder; Jacob Cherian

Mark and Jacob presented the position from the point of view of their arrays (Equallogix) where thin provisioning is used to claim the array has more storage space than it actually has, allowing users to buy more as they approach the limits. The arrays need to be told when space that was written to is no longer in use, but this is nicely handled by the current discard code, being translated to unmap or write_same in the SCSI layer. One of the problems arrays have is that their mapping blocks are quite large (usually 500k-1MB) so any trim the kernel sends down will be aligned to them (and anything smaller will just be ignored). This makes the idea of accumulating discards important to these arrays. There was a general discussion about whether online discard (vs offline run periodically from cron) was more useful to these arrays. The general consensus was that online discard was important, but that running offline every few days to pick out segments that might have been missed due to alignment issues would also be useful.

The next issue was about space availability warnings (the arrays produce unit attention conditions when they run short of physical space). General consensus was that we could print out the condition for the sysadmin to rectify, but the filesystems themselves couldn't really do anything with the knowledge.

Trim and Discard - Martin Petersen

Martin presented the current situation around trim and discard. In SCSI, the problem is that array manufacturers have settled on two representations: UNMAP, which can take multiple ranges, and WRITE SAME, which only takes a single range. Since they're both optional, the array is allowed to pick which one it supports. Fortunately, T10 finally came through with a way for us to tell which one the array wants, so the fallback problem is now largely solved.

Unfortunately, the current problem with ATA devices is that although the TRIM command takes ranges, they're limited to a maximum size of 64k blocks (or 32MB at a time on 512 byte sector devices). Therefore, there's a considerable difficulty in large trims such as the ones done for gigabyte file deletion or mkfs which trims the entire partition. Attention also turned to the behaviour of trim on SATA devices supporting NCQ: since Trim isn't a queued command, it means that, to send a trim, the device has to complete all queued commands then send a trim on its own, then allow the tags to build back up. This causes a pause in the disk writes which can damage performance. Matthew Wilcox said the non-queued problem had been pointed out to T13 (the standards body governing ATA) and a queued version might be in the works. No solution is currently proposed for the limited range problem.

SSD continuation - Eric Seppanen; Matthew Wilcox

After the conclusion of the plenary session on SSD, the I/O track retired to deal with I/O specific issues. First up was the request queue. Eric noted that the current crop of PCIe drivers were trying to do without the request queue entirely. This was largely thought to be a bad idea, since it removes the device from the current priority- and cgroup-based bandwidth QoS updates which are looking to be necessary, at least in virtual environments. Discussion then passed on to reducing the stack latency in the block layer by reducing the number of allocations made, particularly around sense data handling; old ideas about a fixed number of rotating sense buffers were given an airing as was having a rotating queue of ready requests and SCSI commands. The basic advice was to move to using a request queue and we'd see if we could reduce the latency.

Discussion next turned on MSI-X and multiqueue interrupt steering, the idea being to keep data cache hot by returning it to the same CPU issuing the request. Eric noted that really it doesn't have to be the same CPU, just the same non-L1 cache (so same package but not necessarily same core). Jens noted that the block layer kept this information today in req->cpu but that it was up to the driver to use this information to make the data enter the correct queue and the MSI-X to return on the right CPU. It was noted that we can use the current IRQ affinity interface to bind the MSI-X interrupts on a per-cpu basis, but that there was no generic interface to set them up correctly (so perhaps there should be).

Finally, developers noted that the best way to get them thinking about the problems was to give them hardware (hint). Matthew Wilcox noted that even the silicon vendors didn't really have the latest hardware and that his current NVMCHI rig consisted of two x86 boxes joined by a PCIe bus (one to act as host and one to act as device).

CFQ Performance - Jens Axboe; Vivek Goyal

Vivek began by presenting a series of slides showing performance graphs where CFQ was outperformed by either deadline or noop scheduling on a set of plausible workloads. The basic problem seemed to be with arrays, CFQ was mostly on par on either single disk or JBOD configurations. Following discussions, it was decided that the most likely reason for the discrepancy was the CFQ idling function (which tries to figure out wait times to see if more mergeable requests arrive) and CFQ's deliberate design to keep a low queue depth (which is fairly ideal for desktops, but which penalises arrays where queuing is used to keep the cache flooded). After further discussion it was decided that we could use a binary switch within CFQ to flip between the two cases. The optimal way seems to be to assume desktop (low Queue depth, idling) but switch to array either via a whitelist or possibly simply by SATA vs non-SATA disk detection.

Libata Error Reporting - Gwendal Grignou

Gwendal outlined updates to libata to support error handling, the most useful being an error buffer in sysfs that would report on the last several ATA errors encountered.

Another issue is the ability to control PHYs (the endpoints in a SATA setup), both on the host and on the port multiplier. James Bottomley noted that PHY control is already part of the SAS transport class, so the only real need was for some libata transport class to complement the functionality. Gwendal said he already had a proposal for this, but that, unfortunately, libata discovery didn't really work in the transport class infrastructure.

The essential problem is that whereas SCSI does nothing to configure devices until transport_setup_device() time, ATA has already probed and set up the PHYs before SCSI even knows about them (so before transport_add_device() which is the first indication of device presence transport classes get). There was general discussion on the point, but no satisfactory resolution since libata isn't part of the SCSI discovery domain. The best way forward is obviously to remove libata from SCSI so it can do it's own transport class setups at the correct times. However, an interim fix might be to use an intermediate device representing the libata parent and attach the transport class to that (somewhat like the way SAS and FC set up intermediate devices in their SCSI tree).

Target Mode/Config FS - Nic Bellinger

Nic presented the current state of play of the LIO SCSI in-kernel target mode driver. James Bottomley noted that there were three conditions any target driver needed to satisfy before acceptance:

  1. That it would be a drop in replacement for STGT (our current in-kernel target mode driver), since he only wanted a single SCSI target infrastructure.

  2. That it used a modern sysfs based control and configuration plane.

  3. That the code was reviewed as clean enough for inclusion.

Fujita Tomonori reported that STGT was now happy that the current LIO implementation provided a complete replacement that allowed STGT userspace modules to continue functioning, so the STGT community was happy to support its inclusion as replacing STGT.

Nic reported on the current configfs (a sysfs extension for transporting configuration data) interface to LIO which allows basically the full control of setup and tear down of targets from user space.

The first two points being satisfied, Christoph Hellwig had already begun a top to bottom review of the code (resulting in the elimination of 2000 lines so far, with more expected on the way). There was general agreement that pending resolution of the third point by a successful code review from Christoph, LIO might be ready for inclusion in main line by the 2.6.37 merge window.

There was a final question about how this should happen and James stated his preference for seeing the minimal set of patches across the SCSI mailing list to allow for more eyes to review.

FC Sysfs and FCoE - James Smart; Robert Love

Robert kicked off by presenting the current state of play in Fibre Channel over Ethernet which, all in all, was pretty positive. James Smart then took over with a detailed explanation of how the latest Fibre Channel layer 4 mapping (FC-4) would unify FC and FCoE. The essential problem this poses for SCSI is how to represent the new FC-4 mappings in sysfs. The difficulty is that there's a long path (possibly traversing both standard fibre and Ethernet) from the HBA to the rport. Additionally, the protocol introduces a set of non-SCSI end points as well (which should also have sysfs representation). Although it was fairly easy to attach the target and the device, the SCSI host was a bit more of a challenge. After trying various places in the proposed tree, no good single location could be found and it was agreed to go with the current proposal, take the patches to the list to see if anyone else can come up with something better.

Multipath - Alasdair Kergon

Alasdair kicked off a discussion on routing within the block layer. Namely the problem that, unlike in the networking subsystem, we have to name the end point device in the bios we send down. This causes several problems for the device mapper, since it must rewrite the bio destination several times as processing progresses in the device mapper stack. It also means that, once the user has mounted or in any way opened a block device (except a device mapper one), it's impossible to reroute the I/O, meaning that if you want to install additional functionality, you have to unmount (or in extreme cases for root filesystems, reboot) the system.

The first approach discussed was to add generic routing to the block layer. After an exploration of implementation details, it was realised that the only way to make this work is to detach the gendisk structure at the top of the stack from the request queue at the bottom (because the one to one mapping would be broken and we'd be routing over a graph from the gendisk to the selected queue). Unfortunately, this would involve us in a massive block driver rewrite because, although there are subsystems like SCSI that happily operate with only a request queue and don't care about the gendisk, the idea that the two are closely tied together is firmly enshrined in block driver coding.

The next approach turned this concept on its head and asked: what if every device were a device mapper? There seemed to be only two issues in the way of this: firstly that user space would no longer be able to care about device major numbers (they'd become pure fiction instead of mostly fiction) and secondly, device mapper would have to automatically create a pass through routing for any arbitrary device that the kernel opens. The first should largely be solved by udev and the second looks, on the face of it, to be fairly simple., so it was agreed to take this idea to the list for further exploration (Alasdair to do the writeup).


(Log in to post comments)

BSG...

Posted Aug 17, 2010 23:15 UTC (Tue) by pflugstad (subscriber, #224) [Link]

For those not up on the storage acronym list:

BSG seems to be: Block layer SCSI Generic (bsg)

Non-queued TRIM

Posted Aug 20, 2010 12:00 UTC (Fri) by i3839 (guest, #31386) [Link]

> Attention also turned to the behaviour of trim on SATA devices supporting
> NCQ: since Trim isn't a queued command, it means that, to send a trim,
> the device has to complete all queued commands then send a trim on its
> own, then allow the tags to build back up. This causes a pause in the
> disk writes which can damage performance.

TRIMS can be aggregated and send after a flush cache command.

In practice TRIMs should be fine after big read or write requests too,
because those cause enough parallelism without command queueing, so
draining the queue during one of such command should give room for
submitting TRIMs afterwards.

Notes from the LSF summit storage track

Posted Aug 21, 2010 10:30 UTC (Sat) by jmcnulty (guest, #60140) [Link]

If major numbers are to become pure fiction then it's time for sysstat to find more reliable means of identifying disks across reboots. sar -d still uses a major-minor number string to record per disk perf stats.


Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds