LWN.net Logo

The best way to throw blocks away

By Jonathan Corbet
December 1, 2010
An old-style rotating disk drive does not really care if any specific block contains useful data or not. Every block sits in its assigned spot (in a logical sense, at least), and the knowledge that the operating system does not care about the contents of any particular block is not something the drive can act upon in any way. More recent storage devices are different, though; they can - in theory, at least - optimize their behavior if they know which blocks are actually worth hanging onto. Linux has a couple of mechanisms for communicating that knowledge to the block layer - one added for 2.6.37 - but it's still not clear which of those is best.

So when might a block device want to know about blocks that the host system no longer cares about? The answer is: just about any time that there is a significant mapping layer between the host's view of the device and the true underlying medium. One example is solid-state storage devices (SSDs). These devices must carefully shuffle data around to spread erase cycles across the media; otherwise, the device will almost certainly fail prematurely. If an SSD knows which blocks the system actually cares about, it can avoid copying the others and make the best use of each erase cycle.

A related technology is "thin provisioning," where a storage array claims to be much larger than it really is. When the installed storage fills, the device can gently suggest that the operator install more drives, conveniently available from the array's vendor. In the absence of knowledge about disused blocks, the array must assume that every block that has ever been written to contains useful data. That approach may sell more drives in the short term, but vendors who want their customers to be happy in the long term might want to be a bit smarter about space management.

Regardless of the type of any specific device, it cannot know about uninteresting blocks unless the operating system tells it. The ATA and SCSI standard committees have duly specified operations for communicating this formation; those operations are often called "trim" or "discard" at the operating system level. Linux has had support for trim operations for some time in the block layer; a few filesystems (and the swap code) have also been modified to send down trim commands when space is freed up. So Linux should be in good shape when it comes to trim support.

The only problem is that on-the-fly trim (also called "online discard") doesn't work that well. On some devices, it slows operation considerably; there's also been some claims that excessive trimming can, itself, shorten drive life. The fact that the SATA version of trim is a non-queued operation (so all other I/O must be stopped before a trim can be sent to the drive) is also extremely unhelpful. The observed problems have been so widespread that SCSI maintainer James Bottomley was recently heard to say:

However, I think it's time to question whether we actually still want to allow online discard at all. Most of the benchmarks show it to be a net lose to almost everything (either SSD or Thinly Provisioned arrays), so it's become an "enable this to degrade performance" option with no upside.

The alternative is "batch discard," where a trim operation is used to mark large chunks of the device unused in a single operation. Batch discard operations could be run from the filesystem code; they could also run periodically from user space. Using batch discard to run trim on every free space extent would be a logical thing to do after an fsck run as well. Batching discard operations implies that the drive does not know immediately when space becomes unused, but it should be a more performance- and drive-friendly way to do things.

The 2.6.37 includes a new ioctl() command called FITRIM which is intended for batch discard operations. The parameter to FITRIM is a structure describing the region to be marked:

    struct fstrim_range {
	uint64_t start;
	uint64_t len;
	uint64_t minlen;
    };

An ioctl(FITRIM) call will instruct the filesystem that the free space between start and start+len-1 (in bytes) should be marked as unused. Any extent less than minlen bytes will be ignored in this process. The operation can be run over the entire device by setting start to zero and len to ULLONG_MAX. It's worth repeating that this command is implemented by the filesystem, so only the space known by the filesystem to be free will actually be trimmed. In 2.6.37, it appears that only ext4 will have FITRIM support, but other filesystems will certainly get that support in time.

Batch discard using FITRIM should address the problems seen with online discard - it can be applied to large chunks of space, at a time which is convenient for users of the system. So it may be tempting to just give up on online discard. But Chris Mason cautions against doing that:

At any rate, I definitely think both the online trim and the FITRIM have their uses. One thing that has burnt us in the past is coding too much for the performance of the current crop of ssds when the next crop ends up making our optimizations useless. This is the main reason I think the online trim is going to be better and better.

So the kernel developers will probably not trim online discard support at this time. No filesystem enables it by default, though, and that seems unlikely to change. But if, at some future time, implementations of the trim operation improve, Linux should be ready to use them.


(Log in to post comments)

The best way to throw blocks away

Posted Dec 2, 2010 13:23 UTC (Thu) by etienne (subscriber, #25256) [Link]

Maybe TRIM could also be used on rotating HD to clear all unused space to prevent leaking information to "disk-scaning tools" - to be used for instance at power-down time.
Obviously it would be quicker not to surf bad sites at work...

The best way to throw blocks away

Posted Dec 2, 2010 14:21 UTC (Thu) by hamjudo (subscriber, #363) [Link]

It may not make sense for real hardware yet, but it may be a win for virtual hardware. Most noticeably in application software testing, where there is a virtual machine for each of many different test environments. Moving the system images around and uncompressing them, may take longer than the actual tests. If blocks are "discarded" by being zero filled in the virtual environment, they will compress much better and on expansion, they will show up as a sparse file. /dev/zero is the faster than any real disk. We should use it whenever possible. And as etienne pointed out, /dev/zero doesn't leak information.

The best way to throw blocks away

Posted Dec 2, 2010 15:02 UTC (Thu) by ricwheeler (subscriber, #4980) [Link]

Keep in mind that discard is our high level mechanism that is used by the file system layer to inform the IO stack about what is in use or not in use.

For real, physical devices, they have to support the relevant command in their firmware (TRIM for ATA, or WRITE_SAME with UNMAP or UNMAP for SCSI). So far, we have seen that TRIM support has been enabled in many S-ATA SSD's and in a few SCSI based arrays (with others coming). As far as I know, no traditional, single spindle drives implement this.

We definitely could implement a software only layer that is "discard" aware (the device mapper I think is looking at doing this).

I would say that we definitely need both online discard and batched discard. Some devices will really benefit from the online discard (and have no problems with it), others might only do well with batched and some will do best with a both :)

The best way to throw blocks away

Posted Dec 11, 2010 16:44 UTC (Sat) by Lennie (subscriber, #49641) [Link]

I think the way for SSD's to implement TRIM is to immediately accept the commands, so it doesn't block. Buffer it and start to clean up when they 'feel the time is right'. I think SSD's could be smart enough to do this.

It doesn't seem that complicated for me, so maybe this is something Chris Mason thinks as well.

The best way to throw blocks away

Posted Dec 6, 2010 4:41 UTC (Mon) by dougg (subscriber, #1894) [Link]

For some reason t10.org recently renamed "thin provisioning" to "logical block provisioning". Maybe "thin" didn't sound technical enough.

The best way to throw blocks away

Posted Dec 7, 2010 0:31 UTC (Tue) by tack (subscriber, #12542) [Link]

I think all this will continue to be of limited value until it works with MD and LVM in the IO stack. (As far as I know, it currently doesn't.)

The best way to throw blocks away

Posted Dec 8, 2010 16:24 UTC (Wed) by Aissen (subscriber, #59976) [Link]

the kernel developers will probably not trim online discard support at this time
Nice one, editor :-)

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds