| Please consider subscribing to LWN Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net. |
An old-style rotating disk drive does not really care if any specific block contains useful data or not. Every block sits in its assigned spot (in a logical sense, at least), and the knowledge that the operating system does not care about the contents of any particular block is not something the drive can act upon in any way. More recent storage devices are different, though; they can - in theory, at least - optimize their behavior if they know which blocks are actually worth hanging onto. Linux has a couple of mechanisms for communicating that knowledge to the block layer - one added for 2.6.37 - but it's still not clear which of those is best.
So when might a block device want to know about blocks that the host system no longer cares about? The answer is: just about any time that there is a significant mapping layer between the host's view of the device and the true underlying medium. One example is solid-state storage devices (SSDs). These devices must carefully shuffle data around to spread erase cycles across the media; otherwise, the device will almost certainly fail prematurely. If an SSD knows which blocks the system actually cares about, it can avoid copying the others and make the best use of each erase cycle.
A related technology is "thin provisioning," where a storage array claims to be much larger than it really is. When the installed storage fills, the device can gently suggest that the operator install more drives, conveniently available from the array's vendor. In the absence of knowledge about disused blocks, the array must assume that every block that has ever been written to contains useful data. That approach may sell more drives in the short term, but vendors who want their customers to be happy in the long term might want to be a bit smarter about space management.
Regardless of the type of any specific device, it cannot know about uninteresting blocks unless the operating system tells it. The ATA and SCSI standard committees have duly specified operations for communicating this formation; those operations are often called "trim" or "discard" at the operating system level. Linux has had support for trim operations for some time in the block layer; a few filesystems (and the swap code) have also been modified to send down trim commands when space is freed up. So Linux should be in good shape when it comes to trim support.
The only problem is that on-the-fly trim (also called "online discard") doesn't work that well. On some devices, it slows operation considerably; there's also been some claims that excessive trimming can, itself, shorten drive life. The fact that the SATA version of trim is a non-queued operation (so all other I/O must be stopped before a trim can be sent to the drive) is also extremely unhelpful. The observed problems have been so widespread that SCSI maintainer James Bottomley was recently heard to say:
The alternative is "batch discard," where a trim operation is used to mark large chunks of the device unused in a single operation. Batch discard operations could be run from the filesystem code; they could also run periodically from user space. Using batch discard to run trim on every free space extent would be a logical thing to do after an fsck run as well. Batching discard operations implies that the drive does not know immediately when space becomes unused, but it should be a more performance- and drive-friendly way to do things.
The 2.6.37 includes a new ioctl() command called FITRIM which is intended for batch discard operations. The parameter to FITRIM is a structure describing the region to be marked:
struct fstrim_range {
uint64_t start;
uint64_t len;
uint64_t minlen;
};
An ioctl(FITRIM) call will instruct the filesystem that the free space between start and start+len-1 (in bytes) should be marked as unused. Any extent less than minlen bytes will be ignored in this process. The operation can be run over the entire device by setting start to zero and len to ULLONG_MAX. It's worth repeating that this command is implemented by the filesystem, so only the space known by the filesystem to be free will actually be trimmed. In 2.6.37, it appears that only ext4 will have FITRIM support, but other filesystems will certainly get that support in time.
Batch discard using FITRIM should address the problems seen with online discard - it can be applied to large chunks of space, at a time which is convenient for users of the system. So it may be tempting to just give up on online discard. But Chris Mason cautions against doing that:
So the kernel developers will probably not trim online discard support at
this time. No filesystem enables it by default, though, and that seems
unlikely to change. But if, at some future time, implementations of the
trim operation improve, Linux should be ready to use them.
| Index entries for this article | |
|---|---|
| Kernel | Block layer/Discard operations |
| Kernel | Solid-state storage devices |
The best way to throw blocks away
Posted Dec 2, 2010 13:23 UTC (Thu) by etienne (guest, #25256) [Link]
The best way to throw blocks away
Posted Dec 2, 2010 14:21 UTC (Thu) by hamjudo (guest, #363) [Link]
It may not make sense for real hardware yet, but it may be a win for virtual hardware. Most noticeably in application software testing, where there is a virtual machine for each of many different test environments. Moving the system images around and uncompressing them, may take longer than the actual tests. If blocks are "discarded" by being zero filled in the virtual environment, they will compress much better and on expansion, they will show up as a sparse file. /dev/zero is the faster than any real disk. We should use it whenever possible. And as etienne pointed out, /dev/zero doesn't leak information.
The best way to throw blocks away
Posted Dec 2, 2010 15:02 UTC (Thu) by ricwheeler (subscriber, #4980) [Link]
For real, physical devices, they have to support the relevant command in their firmware (TRIM for ATA, or WRITE_SAME with UNMAP or UNMAP for SCSI). So far, we have seen that TRIM support has been enabled in many S-ATA SSD's and in a few SCSI based arrays (with others coming). As far as I know, no traditional, single spindle drives implement this.
We definitely could implement a software only layer that is "discard" aware (the device mapper I think is looking at doing this).
I would say that we definitely need both online discard and batched discard. Some devices will really benefit from the online discard (and have no problems with it), others might only do well with batched and some will do best with a both :)
The best way to throw blocks away
Posted Dec 11, 2010 16:44 UTC (Sat) by Lennie (subscriber, #49641) [Link]
It doesn't seem that complicated for me, so maybe this is something Chris Mason thinks as well.
The best way to throw blocks away
Posted Dec 6, 2010 4:41 UTC (Mon) by dougg (subscriber, #1894) [Link]
The best way to throw blocks away
Posted Dec 7, 2010 0:31 UTC (Tue) by tack (guest, #12542) [Link]
The best way to throw blocks away
Posted Dec 8, 2010 16:24 UTC (Wed) by Aissen (guest, #59976) [Link]
the kernel developers will probably not trim online discard support at this timeNice one, editor :-)
Copyright © 2010, Eklektix, Inc.
This article may be redistributed under the terms of the
Creative
Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds