LWN.net Logo

Block layer discard requests

By Jonathan Corbet
August 12, 2008
Solid-state, flash-based storage devices are getting larger and cheaper, to the point that they are starting to displace rotating disks in an increasing number of systems. While flash requires less power, makes less noise, and is faster (for random reads, at least), it has some peculiar quirks of its own. One of those is the need for wear leveling - trying to keep the number of erase/write cycles on each block about the same to avoid wearing out the device prematurely.

Wear leveling forces the creation of an indirection layer mapping logical block numbers (as seen by the computer) to physical blocks on the media. Sometimes this mapping is done in a translation layer within the flash device itself; it can also be done within the kernel (in the UBI layer, for example) if the kernel has direct access to the flash array. Either way, this remapping comes into play anytime a block is written to the device; when that happens, a new block is chosen from a list of free blocks and the data is written there. The block which previously contained the data is then added to the free list.

If the device fills up with data, that list of free blocks can get quite short, making it difficult to deal with writes and compromising the wear leveling algorithm. This problem is compounded by the fact that the low-level device does not really know which blocks contain useful data. You may have deleted the several hundred pieces of spam backscatter from your mailbox this morning, but the flash mapping layer has no way of knowing that, so it carefully preserves that data while scrambling for free blocks to accommodate today's backscatter. It would be nice if the filesystem layer, which knows when the contents of files are no longer wanted, could communicate this information to the storage layer.

At the lower levels, groups like the T13 committee (which manages the ATA standards) have created protocol extensions to allow the host computer to indicate that certain sectors are no longer in use; T13 calls its new command "trim." Upon receipt of a trim command, an ATA device can immediately add the indicated sectors to its free list, discarding any data stored there. Filesystems, in turn, can cause these commands to be issued whenever a file is deleted (or truncated). That will allow the storage device to make full use of the space which is truly free, making the whole thing work better.

What Linux lacks now, though, is the ability for filesystems to tell low-level block drivers about unneeded sectors. David Woodhouse has posted a proposal to fill that gap in the form of the discard requests patch set. As one might expect, the patches are relatively simple - there's not much to communicate - though some subtleties remain.

At the block layer, there is a new request function which can be called by filesystems:

    int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
			     unsigned nr_sects, bio_end_io_t end_io);

This call will enqueue a request to bdev, saying that nr_sects sectors starting at the given sector are no longer needed and can be discarded. If the low-level block driver is unable to handle discard requests, -EOPNOTSUPP will be returned. Otherwise, the request goes onto the queue, and the end_io() function will be called when the discard request completes. Most of the time, though, the filesystem will not really care about completion - it's just passing advice to the driver, after all - so end_io() can be NULL and the right thing will happen.

At the driver level, a new function to set up discard requests must be provided:

    typedef int (prepare_discard_fn) (struct request_queue *queue, 
                                      struct request *req);

    void blk_queue_set_discard(struct request_queue *queue, 
                               prepare_discard_fn *dfn);

To support discard requests, the driver should use blk_queue_set_discard() to register its prepare_discard_fn(). That function, in turn, will be called whenever a discard request is enqueued; it should do whatever setup work is needed to execute this request when it gets to the head of the queue.

Since discard requests go through the queue with all other block requests, they can be manipulated by the I/O scheduler code. In particular, they can be merged, reducing the total number of requests and, perhaps, pulling together enough sectors to free a full erase block. There is a danger here, though: the filesystem may well discard a set of sectors, then write new data to them once they are allocated to a new file. It would be a serious mistake to reorder the new writes ahead of the discard operation, causing the newly-written data to be lost. So discard operations will need to function as a sort of I/O barrier, preventing the reordering of writes before and after the discard. There may be an option to drop the barrier behavior, though, for filesystems which are able to perform their own request ordering.

Outside of filesystems, there may occasionally be a need for other programs to be able to issue discard requests; David's example is mkfs, which could discard the entire contents of the device before making a new filesystem. For these applications, there is a new ioctl() call (BLKDISCARD) which creates a discard request. Needless to say, applications using this feature should be rare and very carefully written.

David's patch includes tweaks for a number of filesystems, enabling them to issue discard requests when appropriate. Some of the low-level flash drivers have been updated as well. What's missing at this point is a fix to the generic ATA driver; this will be needed to make discard requests work with flash devices using built-in translation layers - which is most of the devices on the market, currently. That should be a relatively small piece of the puzzle, though; chances are good that this patch set will be in shape for inclusion into 2.6.28.


(Log in to post comments)

Block layer discard requests

Posted Aug 14, 2008 2:42 UTC (Thu) by willy (subscriber, #9762) [Link]

I've been looking into the ATA piece of this (at Dave's request).  Part of the problem is that
the ATA TRIM command isn't fully specified yet, so we'll need to allocate a temporary command
to test it.  This also means that no existing device on the market can take advantage of it.

Another part of the problem is that the way we handle ATA these days is through the SCSI
layer.  So in order to implement the ATA TRIM command, we first have to implement the SCSI
PUNCH command.  The SCSI PUNCH command is a very complex beast.  It allows you to specify all
kinds of things that ATA (and indeed Linux) don't let you do, and as a result is a pain to
implement.  I'm negotiating with T10 to attempt to get PUNCH simplified, but not having much
luck so far.

It's a SMOP but I have more urgent projects right now.  One of them is ata_ram, which (not
entirely coincidentally) today grew the ability to do lazy allocation of the memory pages it
uses as its backing store.  Once that feature's debugged, supporting TRIM should be feasible,
then I'll have motivation to get back to implementing PUNCH.

Block layer discard requests

Posted Aug 14, 2008 17:27 UTC (Thu) by GreyWizard (guest, #1026) [Link]

This also means that no existing device on the market can take advantage of it.

What about virtual machines? At present once a block is allocated on a dynamic disk the host has no way to know that it can later be discarded, even if almost every file in the guest is deleted. Isn't this exactly the same problem? Changing the software that emulates an ATA device could probably happen more quickly than hardware changes.

Block layer discard requests

Posted Aug 14, 2008 17:42 UTC (Thu) by willy (subscriber, #9762) [Link]

You're quite correct that a virtual machine could make some use of it.  If we have an
interface to the filesystem that allows us to punch out blocks (see eg
http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?co...) and use sparse files, we could "overprovision".

The software that emulates an ATA device can probably be changed about as quickly as real ATA
devices -- remember drives are full of firmware too.  The holdup is getting T13 to decide on a
command number and publish ATA8.  Then it's a 'simple matter' of getting users to upgrade ...
and I think a large number of them are going to be wary about further updates from at least
one major virtualisation provider from now on.

Oh, of course drive vendors aren't typically quick to publish microcode updates.  You've
already bought the drive, so they have little to gain from giving you firmware updates, and as
every firmware update is potentially dangerous, they have a certain liability there too.

Block layer discard requests

Posted Aug 14, 2008 19:09 UTC (Thu) by GreyWizard (guest, #1026) [Link]

That makes sense.  Thanks for the extra information.

Block layer discard requests

Posted Aug 14, 2008 5:11 UTC (Thu) by jzbiciak (✭ supporter ✭, #5246) [Link]

Otherwise, the request goes onto the queue, and the end_io() function will be called when the discard request completes. Most of the time, though, the filesystem will not really care about completion - it's just passing advice to the driver, after all - so end_io() can be NULL and the right thing will happen.

Time for my naive questions: Is there ever a race condition where you could free some number of sectors, and then reallocate them, such that the writes for those sectors somehow get ahead of the trim operation? Is there anything that guarantees the ordering?

I ask because I was under the impression that I/O schedulers like CFQ try to balance among processes, and so it's not clear that a truncate operation from process A remains strongly ordered relative to a file written by process B.

Block layer discard requests

Posted Aug 14, 2008 11:22 UTC (Thu) by willy (subscriber, #9762) [Link]

We guarantee the ordering by making discard requests be soft barriers.

Reordering requirements

Posted Aug 15, 2008 9:34 UTC (Fri) by pjm (subscriber, #2080) [Link]

What am I missing?  Jon Corbet's article and seanyoung's implementation and possibly willy's
comment above suggest that these trim/punch/discard/... requests are somehow very special in
their reordering requirements, whereas it seems to me that a discard is just the same as a
write: the existing code mustn't reorder any writes to a given block, and the new code mustn't
reorder any write-or-discard requests to a given block.  This shouldn't require any more
barriers or speed penalty than we already have for writes.

(The only difference is that one might use *looser* requirements for reordering discards and
reads for a given block, depending on the semantics of reading a discarded block.)

Reordering requirements

Posted Aug 15, 2008 14:54 UTC (Fri) by willy (subscriber, #9762) [Link]

This was already discussed in the email thread ... overlapping writes are already serialised
by the page lock.  Discard doesn't have a page to lock, so you can't rely on this.

Block layer discard requests

Posted Aug 14, 2008 10:52 UTC (Thu) by i3839 (guest, #31386) [Link]

Umm, perhaps I'm naive, but I thought (hoped) that wear-leveling already 
re-used existing blocks, at least hardware based ones, or is that too
expensive? It's seems the only sane way to have long term guaranteed
reliability, by moving content on little written blocks to elsewhere and
using those blocks for frequent writes too. Sure, writing n blocks means 
you need to read n blocks and write 2n blocks, but all blocks would be 
written the same number of times in the end. This shuffling around doesn't
have to happen all the time though.

Block layer discard requests

Posted Aug 14, 2008 11:24 UTC (Thu) by willy (subscriber, #9762) [Link]

You're correct that wear-levelling reuses existing blocks.  This is about telling the
flash-based device which blocks aren't used any more and hence don't need to be copied.  The
filesystem knows when a block isn't used any more, it just needs a way to tell the flash
device.

Block layer discard requests

Posted Aug 14, 2008 14:38 UTC (Thu) by i3839 (guest, #31386) [Link]

Ah, okay, that makes sense.

Block layer discard requests

Posted Aug 14, 2008 11:43 UTC (Thu) by seanyoung (subscriber, #28711) [Link]

A couple of years ago I did something similar (but not as complete).

http://lwn.net/Articles/162776/

I found that having forget/discard as barrier requests can be very bad for performance. At any
point when a forget/discard is issued, all data is written to flash which would not have been
necessary without the barrier; the dirty blocks could have remained in memory.

All of this can be solved through proper merging though. The rules would become fairly
difficult, I think.

The other issue was that only the in-tree FTL layers could make use of them. CompactFlash ATA
does have an "erase sectors" option but this is not really what you want (pre-erase sectors
such that the next write will succeed without waiting for a flash erase).

Block layer discard requests

Posted Aug 18, 2008 22:57 UTC (Mon) by jlokier (guest, #52227) [Link]

They're not _filesystem_ barrier requests: a DISCARD doesn't cause dirty blocks/pages to be
flushed.  Rather, it's a barrier in the request queue, so any write submitted to the request
queue afterwards will not pass the DISCARD.

Block layer discard requests

Posted Aug 19, 2008 9:39 UTC (Tue) by seanyoung (subscriber, #28711) [Link]

You say writes submitted after the discard will not be merged before the barrier. 

So say we have 100 of files being deleted on a FAT filesystem. After each file deletion a
discard (i.e. BARRIER) is submitted. 

Now in stead of writing the FAT table changes once at the end, the FAT table changes must be
written 100 times.

Block layer discard requests

Posted Aug 19, 2008 16:47 UTC (Tue) by jlokier (guest, #52227) [Link]

Writes were never merged by the device request queue anyway.

Merging the FAT table writes happens at a higher level: in the FAT filesystem.  That's not
affected by these changes.  The FAT filesystem will submit a series of DISCARD requests for
each deleted file, interspersed with a smaller number of FAT write requests which merge
multiple changes.

... and Flash ERASE commands

Posted Aug 24, 2008 4:48 UTC (Sun) by HalfMoon (guest, #3211) [Link]

CompactFlash ATA does have an "erase sectors" option but this is not really what you want (pre-erase sectors such that the next write will succeed without waiting for a flash erase).

I've seen a fair number of "raw flash" chips with that same semantic: writes to erased segments are faster. By a factor of about five, so it's well worth leveraging. In some cases that implies using different write procedures though ... so just having a "trim" hook isn't enough, the lowlevel code would need to know whether the area being written was already erased/trimmed.

Block layer discard requests

Posted Aug 14, 2008 15:58 UTC (Thu) by bronson (subscriber, #4806) [Link]

This sounds like a most needed addition.  I'll be watching its development.

Quick question: why the terminology change?  If T13 calls it "trim", why should Linux call it
"discard"?

Block layer discard requests

Posted Aug 14, 2008 17:34 UTC (Thu) by willy (subscriber, #9762) [Link]

SCSI calls it PUNCH.  I dread to think what the Flash people call it.  Names change in
different standards.

Block layer discard requests

Posted Aug 15, 2008 11:39 UTC (Fri) by dougg (subscriber, #1894) [Link]

Poking around the latest ATA8-ACS and SCSI drafts I can find no sign of the ATA TRIM command
(in D1699r6-ATA8-ACS.pdf). Willy, could you give a document reference.

As for the SCSI PUNCH command that is in the Object Storage Devices (OSD-2) command set. No
wonder it is complex with a 236 byte cdb! Is OSD appropriate for flash devices? [Sledge hammer
for an acorn.] Surely a new SCSI command is needed in SBC-3 (SCSI Block Commands as used by
disks today) and the command name "TRIM" hasn't been used yet.

Also, there are the SCSI to ATA Translation (SAT) PASS-THROUGH commands that allow, for
example, a ATA TRIM command to be tunnelled through a SCSI command layer.

Block layer discard requests

Posted Aug 15, 2008 14:52 UTC (Fri) by willy (subscriber, #9762) [Link]

The proposals I'm looking at are
http://www.t13.org/Documents/UploadedDocuments/docs2008/e... and http://www.t10.org/ftp/t10/document.08/08-149r0.pdf
I think this PUNCH is different from the PUNCH you're looking at in OSD2.

Block layer discard requests

Posted Aug 15, 2008 7:09 UTC (Fri) by dlang (✭ supporter ✭, #313) [Link]

this is an interesting feature. If Linux implements this well I could see devices showing up
in a few years that internally do a secure wipe when the block if freed.

or a ram-based device that shuffles used sections around to be able to power off unused areas.
(or for that matter, shuffle things on a raid 10 array to power off unneeded drives)

and with the scsi target patches that are out there giving you the ability to use a full linux
box as a drive for other systems people wouldn't have to wait for the drive manufacturers to
develop all of this.

this could get _very_ interesting

Block layer discard requests

Posted Aug 15, 2008 17:17 UTC (Fri) by dwmw2 (subscriber, #2063) [Link]

Actually I dropped the end_io argument to blkdev_issue_discard() — it was just too much of a pain to use. And I also implemented a sb_issue_discard() function, which uses the block size from sb->s_blocksize rather than a hard-coded 512-byte sector size.

Block layer discard requests

Posted Aug 15, 2008 23:09 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

there is a new ioctl() call (BLKDISCARD) which creates a discard request.

This is the oft-discussed file punch/clear/freespace system call -- on a block device special file. Why doesn't linux have this yet? Other OSes do. This new ioctl class seems to be actively avoiding it.

Needless to say, applications using this feature should be rare and very carefully written.

No, I don't see it. These would be approximately the same applications that write to block device special files today and their use of the discard ioctl would be no more careful than their use of write().

Block layer discard requests

Posted Aug 16, 2008 9:33 UTC (Sat) by dwmw2 (subscriber, #2063) [Link]

This is the oft-discussed file punch/clear/freespace system call -- on a block device special file. Why doesn't linux have this yet? Other OSes do. This new ioctl class seems to be actively avoiding it.
Actually, Linux already has it in one form — madvise(MADV_REMOVE) can do this for inodes with a ->truncate_range() method (which is currently only tmpfs and shmpfs, but I plan to add JFFS2).

I didn't actively avoid it — I did take a look at what it would take to hook up something similar, but it was decidedly non-trivial. I'll probably come back to it, but it doesn't live in a patch sequence of primarily block-layer modifications.

Block layer discard requests

Posted Aug 16, 2008 23:14 UTC (Sat) by TimMann (subscriber, #37584) [Link]

It's nice to see a primitive like this making it into mainline Linux.  We did something
similar for the Itsy handheld back around 2000-2001.  Itsy had flash on the motherboard with
its own drivers, so we didn't need to wait for anything to be added to the SCSI or ATA
interfaces before the idea could be used.  I don't think reordering of discards vs. writes was
possible with our flash drivers.  See:

http://tim-mann.org/papers/hamburgen01itsy.pdf
http://tim-mann.org/papers/SRC-TN-2001-001.pdf

Even earlier than that, the Petal distributed virtual disk had a discard primitive and the
Frangipani filesystem built on top of it used it.  Conventional filesystems could run on Petal
too, but would used disk space more efficiently if modified to use the discard primitive.

http://tim-mann.org/papers/frangipani.pdf
http://portal.acm.org/citation.cfm?id=237157

Basically, this primitive is good for most kinds of virtual disk, to allow savings of one kind
or another if the virtual disk is informed that it can throw some data away.  I was always
kind of surprised it didn't make it into standard OSes long ago.



Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds