The day began with a review of request-based multipathing, which is trying to solve the problem that the device mapper operates multipathing way above the block layer and thus has great difficulty getting accurate data from and control of the underlying devices, which are often much lower, way below the block subsystem. The agreed solution to this problem is to implement a request-based layer for multipathing which can sit underneath the block layer and provide a bidirectional conduit to the device mapper above the block layer such that status and event processing of multipath related information can occur much closer to where the actual events are generated.
Hannes Reinecke presented the current state of Multipath which, from a SCSI and block point of view, is generally good: all of the necessary SCSI and block components for request-based multipathing are now upstream. We also have specific device-based multipath plugins which handle the vagaries of the multipath implementation in certain well-known storage arrays and ensure that path failure and path switchover occur much more expediently. The remaining code changes are for the device mapper layer which are sitting in Alasdair Kergon's tree but which aren't yet upstream.
The remaining two problems discussed were fast fail for the path checker daemon. Device mapper uses a daemon to ping the path periodically to make sure it is alive. Unfortunately if something bad happens and error handling takes over, it can be many seconds before the ping returns. Several ways of handling this were discussed, but it was agreed that some type of uevent when error handling is invoked on the device, thus informing the path checker that something was wrong, would be implemented.
Finally, the problems with last path disconnection were debated. There are two competing tensions here: the first is that the device mapper queue_if_no_path setting, which most people use, has the problem that if all paths are failed it will hang on to outstanding I/O forever until a path returns. Worse, if the device is backing a filesystem, the normal I/O writeback path will be filling the system up with dirty pages that cannot be cleaned because writeout is stuck. Finally, the system can get to a point where even if a path returns, we don't have enough memory left to do all the operations necessary to reattach the path, which means the system is live-locked on dirty memory and never recovers. After a detour into mempool implementations, it was agreed that, given device mapper's reliance on a user-space daemon for path attachment, it was impossible to protect all allocations such that the user space daemon can do its job. Thus, the final agreed solution was to add a separately tunable timeout to the queue_if_no_path case to set a wait time beyond which device mapper will error out all I/O in the queue (thus freeing the dirty memory) and allow the system to recover as best it can.
Unfortunately, I had a conflicting meeting, so there's no summary of this one.
Tejun Heo led off with a request for help on a particularly knotty problem: enabling zero length barriers for ATA devices has lead some of them to power off briefly when a barrier flush is issued. Unfortunately, this shows up as a brief glitch on the SATA bus, but causes the entire drive cache to be trashed, leading to the loss of data. The greatest problem here is detecting when drives actually do this, because the event is indistinguishable from a bus fluctuation. Several root causes were discussed, including the possibility that the problem is created by power saving within multiple drives causing them to power up briefly to flush the cache, overwhelm the power supply and thus lose the data. In the end it was agreed that the best way of detecting this situation is from user space, using the smart power cycle counter (The reason for using a user space daemon is that there's no standard way of getting at the drive SMART data for ATA drives, although SCSI does have a standard mode page).
Discussion then moved on to the current status of getting libata out of SCSI: we have had several successes, notably timer handling and pieces of error handling have moved up to block. Unfortunately, the current progress has reached the point where it's being impeded by the legacy IDE subsystem which is still relying on some very old fields and undocumented behaviour of the block layer, since the next step is to simplify the block to low level interface and move to a more exact and well understood API. No solutions were proposed, but Tejun will continue on trying to clean up both block and drivers/ide in parallel to achieve this.
This was a joint session presented by Fernando Cao, Vivek Goyal and Nauman Rafique. Following feedback from the summit last year that an approach which worked only for the CFQ scheduler was unacceptable, the group had moved on to consider a layered approach that would provide a throughput-aware scheduling service that worked for all services (it was pointed out that the name for this: "I/O Controller" unfortunately caused most practitioners immediately to think of I/O Cards, and thus might need changing).
The problems with the I/O Controller approach, which NTT voiced through Fernando, is that, in an advanced storage topology, it was very hard to ensure consistent throughput to virtual machines which sit way above the I/O controller (which only sees the physical devices, not even the low layer LVM or virtual devices). After several rounds of discussion, it was agreed that this was a problem, but that the correct place for the solution was still at the I/O scheduler level. However, we would also run a lightweight daemon closer to the virtual machines, very similar to irqbalanced, whose job was to keep an eye on the virtual machine I/O throughput and adjust the low level I/O schedulers periodically to achieve fairness to the virtual machines (and ensure they achieved their targeted I/O bandwidth).
The last problem discussed was that of correctly accounting for I/O operations as they passed through the I/O scheduler. I/O is accounted through an I/O context which is attached to the bio as it passes through the scheduler (meaning it is accounted to whatever process happens to be current at the time). Unfortunately, this means that a lot of I/O is incorrectly accounted to either the pdflush thread, or to random processes that happen to be current when the I/O ended up being submitted. In order to do the accounting correctly, some type of per-page pointer to the io_context needs to be added. The feedback from last year leaned very strongly toward the thought that expanding struct page, which is critical-path infrastructure, would not be appreciated, so Fernando's proposal was to have a second array, indexed by page frame number, in which the I/O context could be stored. This method would be tunable by a config option, but would occupy about 1KB per megabyte of memory at boot time. After some discussion of the memory cost, it was agreed that this probably represented about the only way forward, but that the secondary struct page mechanism should be made available for other processes that might need to attach pointers to struct page but which also didn't need to be in the critical path.
Finally, the entire group took an action to send their current patch queue to both the Linux kernel mailing list and the device mapper list.
James Smart reviewed the current status of the much maligned SNIA HBA API for Fibre Channel. When originally promulgated, this standard mandated a large and ugly ioctl() layer sitting in all drivers and was roundly rejected by the kernel developers. Since that time, most of the functionality required by the HBAAPI has been implemented in the fibre channel transport class as sysfs attributes. The only remaining implementation piece is the ability to transport arbitrary FCP (Fibre Channel Protocol) frames onto the SAN. Version five of the patch is to do this (similar to the way SAS does expander frame communication based on the Linux standard SG_IO ioctl) is currently circulating on the SCSI mailing list and is expected to be merged in 2.6.31. The mechanism is slightly more complex than the SAS API since the frame command is defined by the implementations, not by FCP, with the response data communicated in the sense buffer and the response frame in the bidirectional in data.
The last remaining problem is that of receiving asynchronous events from the fibre. After a bit of discussion, it was agreed that using SG_IO to queue receive frames and some type of netlink message or uevent to signal their arrival was probably an optimal solution for this. James agreed to return to the mailing list with patches implementing this for review. With this last piece, equivalent functionality to the HBAAPI (with even one to one mapping in user space) should now be available on Linux.
Christoph Hellwig reported that the majority of the discussion had been covered in the original plenary session and that not much remained to be covered in the storage specific session.
Dan Williams took us through an implementation he was working on that provided for sysfs reporting and configuration of md raid volumes. James Bottomley noted that this was very similar to the way the original RAID transport class had been implemented and Dan agreed to look at this to determine if he could provide a generic mechanism that would provide initially for a unified view of RAID volumes that would be identical between the various hardware and software RAID solutions and which finally might provide us with a unified configuration interface for each of them.
Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds