|
|
Log in / Subscribe / Register

Partial drive depopulation

By Jake Edge
April 27, 2016

LSFMM 2016

With today's large storage devices there are times when a component of the drive will fail (e.g. a head in a disk or a die in an SSD), which reduces the capacity of the device without rendering it completely unusable. But the arrangement of logical block addresses (LBAs) on the devices is such that the non-functioning LBAs are scattered across the device's address space. There is a need to "depopulate" (or "depop") those LBAs so that the rest of the device can continue to be used. Hannes Reinecke and Damien Le Moal led a combined storage and filesystem session at the 2016 Linux Storage, Filesystem, and Memory-Management Summit to discuss depop and how it should be handled by the kernel.

[Hannes Reinecke & Damien Le Moal]

Le Moal began by outlining the problem, noting that there are several types of components (head, surface, die, channel) that can go bad in a device without taking the entire device with them. The device will report the problem with a "unit attention" condition. One way to handle that is with offline logical depop, where the drive is simply reformatted to the new, smaller capacity. Reinecke said that would "not require a lot of work" to handle.

The question of recovering data from the good portion of the device prior to reformatting came up. Ted Ts'o asked if there would be a list of bad sectors delivered to the kernel. Le Moal said there was a way for the host to get that list, but James Bottomley thought that sounded like an "awful lot of data to store in the kernel". For offline depop, though, the data would not need to be stored, Le Moal said.

It is a large list, Fred Knight said, as the bad sectors are likely to be spread across the LBA range. Christoph Hellwig called the list "useless" to the kernel, but Knight said that if it was just needed for recovering the good data, the block list need not be stored. The problem is that disks are not uniform in the number of sectors per track across the drive and bad-block remapping can also complicate things.

The discussion then turned to online logical depop, where the idea is to try to avoid reformatting the drive. The healthy LBAs would be kept intact, which would leave holes in the LBA space. The holes could be "amputated", removing them from the LBA range and never using them again. Or the blocks could be "regenerated" by allocating other blocks and remapping them into the holes.

All of that seemed "overly complicated" to Ric Wheeler. He suggested that users would simply regenerate the filesystem from backups rather than fix the holes. They would truncate the size of the device and reformat it to get it back into production. The data still on the platters would just be ignored.

Chris Mason agreed that users are likely to take the drive out of production, truncate and reformat it, then put it back. "Healing" drives is not an online process, he said. Wheeler said that he thought any work on online depop was likely a waste of time.

But Knight said that a failure that only affected 10% of the drive would only take 10% of the time to rebuild, which might be attractive in some cases. Mason, though, felt that most would want some kind of verification step before bringing a partially failing drive back online. It may be true that it is simply one component that has failed, but that isn't truly known until the drive is examined and tested. Failing to do that, could result in a "bunch of borderline stuff" running in production, he said.

Bottomley and Martin Petersen both said that a large discontiguous LBA range was not really usable. Wheeler summed up the feeling in the room by saying that offline depop is something that can be supported, but that unless the LBA regions were large or computable, they were not something that the kernel developers would use; "scatter-gather lists of LBAs" are not helpful.


Index entries for this article
KernelBlock layer
ConferenceStorage, Filesystem, and Memory-Management Summit/2016


to post comments

Partial drive depopulation

Posted Apr 28, 2016 11:37 UTC (Thu) by skitching (guest, #36856) [Link] (1 responses)

I work with user-space distributed filesystems, such as Apache Hadoop DFS (HDFS), which are used in large clusters. When a file is "uploaded" from a normal filesystem into HDFS, it is split into equal chunks (typically 512MB) and each chunk is saved as a normal file on a normal filesystem (eg ext4) on multiple nodes in a cluster. Other systems work similarly.

A background thread on each server periodically verifies each "chunk" (native file) against its checksum; if this fails then the native file is marked as bad and the system automatically makes (somewhere in the cluster) an additional copy of the chunk from one of the surviving copies.

Losing an entire drive is therefore survivable, and the system automatically recovers. However if a significant portion of the files on a partially-failed drive could be preserved, it would save a lot of IO.

When managing a cluster of a few thousand commodity servers, each with say 8 x 1TB disks, then disk failures are common. Each time the system can automatically recover rather than needing manual intervention is helpful.

Partial drive depopulation

Posted Apr 29, 2016 0:20 UTC (Fri) by gutschke (subscriber, #27910) [Link]

I am worried that in practice you lose a lot more than just a couple of files. There is an extremely high likelihood that you'll lose file system meta data (inodes, extents, directories) that isn't currently in memory.

Even if parts of files or even full files are still theoretically on the intact part of the media, the kernel might have no idea where to even find them.

As for the files that do have intact meta data, they probably straddle multiple platters (in the case of spinning media) or multiple dies (for SSDs), as that maximizes parallelism. So, if one of the heads or one of the dies fails, you'll end of with a hole-y file.

In other words, yes, some data might be retrievable, especially if the files are small. But most of it is either entirely inaccessible, and the rest is fragmented. In the end, you gain so little from recovery and the management overhead is so high, you are better off retrieving the data from some other redundant source.


Copyright © 2016, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds