LWN: Comments on "Issues around discard"

Issues around discard

zdzichu — Thu, 16 May 2019 05:40:33 +0000

For firmware, I think Linux Vendor Firmware Service (https://fwupd.org/) can be used to update SSDs firmware under Linux.

Issues around discard

scientes — Thu, 16 May 2019 04:05:07 +0000

The problem is that the firmware is non-free software. I've had SSDs that just fail with TRIM for example, and you are given a windows-only update tool with a binary blob firmware. I just threw it out and decided to never buy from the company again.

These firmware run ARM M and they should be free software.

There is a project in this direction: http://www.openssd.io/

Issues around discard

roblucid — Tue, 14 May 2019 08:57:04 +0000

May be the kernel needs to give user space the ability to give it some hints, so something like a drivecap file plus a utility which then configures the kernel on drive characterisics policy? This is how BSD used to tune to HDD's, when sectors per cylinder mattered (way before HDDs switched to LBA addressing).

Then SSD vendors (or large customers) can characterise how their drive is expected to be used, LBA reuse, discard penalties, phantom discards and the like. You might even be able to tune for service life, with user space logging expected degraded performance.

We should not look at discard as an uniform feature in the first place

hmh — Tue, 14 May 2019 02:23:25 +0000

It entirely depends on the competence of the vendor that wrote the device firmware, and that whomever designed the queue protocol was not crazy enough to forget about write collisions between queues.

A discard really is just a write as far as ordering and races/collisions go.

We should not look at discard as an uniform feature in the first place

Fowl — Mon, 13 May 2019 12:39:01 +0000

What happens if a write and a discard race?

There are two very different types of TRIM command

GoodMirek — Fri, 10 May 2019 13:20:28 +0000

In my eyes, there is a significant difference between write and discard.
If a write fails it can cause a data loss, which is a critical issue and therefore requires transaction safety.
If a discard fails, the worst thing that can happen is some performance and wear deterioration, which is negligible issue.

it is the O_PONIES issue again!

miquels — Fri, 10 May 2019 00:01:13 +0000

I have several SSDs in production that have written not 700 TB, but 7600 TB. In 4 years' time. Datacentre SSDs FTW :)

=== START OF INFORMATION SECTION ===
Device Model:     Samsung SSD 845DC PRO 800GB
User Capacity:    800,166,076,416 bytes [800 GB]
Sector Size:      512 bytes logical/physical

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  5 Reallocated_Sector_Ct   PO--CK   099   099   010    -    3
  9 Power_On_Hours          -O--CK   093   093   000    -    34281
 12 Power_Cycle_Count       -O--CK   099   099   000    -    4
177 Wear_Leveling_Count     PO--C-   076   076   005    -    9158
179 Used_Rsvd_Blk_Cnt_Tot   PO--C-   099   099   010    -    3
180 Unused_Rsvd_Blk_Cnt_Tot PO--C-   099   099   010    -    7037
241 Total_LBAs_Written      -O--CK   094   094   000    -    16410885339592
242 Total_LBAs_Read         -O--CK   097   097   000    -    7734700749043
250 Read_Error_Retry_Rate   -O--CK   100   100   001    -    0

There are two very different types of TRIM command

masoncl — Wed, 08 May 2019 15:26:04 +0000

"That's why mentioning that there are both blocking and nonblocking TRIMs matters so much, because if non-blocking TRIM is available it effectively works like a write as to queueing too, thus there is very little to be gained from backgrounding TRIMs like XFS does. Apart from love of overcomplexity, which seems rather common in the design of XFS."

Actual queueing support for discards does change the math a bit, but the fundamental impact on latency of other operations is still a problem. Sometimes it's worse because you're just allowing the device to saturate itself with slow operations.

The XFS async trim implementation is pretty reasonable, and it can be a big win in some workloads. Basically anything that gets pushed out of the critical section of the transaction commit can have a huge impact on performance. The major thing it's missing is a way to throttle new deletes from creating a never ending stream of discards, but I don't think any of the filesystems are doing that yet.

Issues around discard

nilsmeyer — Wed, 08 May 2019 06:41:30 +0000

There are a lot of websites doing hardware testing already, however few of them do testing with Linux (Phoronix and Servethehome come to mind). I think they could and should add discard testing, with phoronix at least the procedure is somewhat standardized. Another test I would really like is fsync() performance, since that shows you the actual, durable write performance of the drive.

There are two very different types of TRIM command

walex — Tue, 07 May 2019 23:44:04 +0000

“XFS will allow the commit to finish and let the trims continue floating down in the background”

Indeed, but a discard is pretty much like a write, so it could be handled the same way, wo why would XFS do that complicated stuff? The obvious reason is that if discarding is handled synchronously, like Btrfs and ext4 do, then issuing blocking (non queued) TRIM can cause long freezes, which is indeed why many people don't use the discard mount option but use fstrim every now and then at quiet times.

That's why mentioning that there are both blocking and nonblocking TRIMs matters so much, because if non-blocking TRIM is available it effectively works like a write as to queueing too, thus there is very little to be gained from backgrounding TRIMs like XFS does. Apart from love of overcomplexity, which seems rather common in the design of XFS.

Put another way pretty much the entire TRIM debate has been caused by the predominance of blocking TRIM in the SATA installed base of consumer flash SSDs (and the other minor reason has been the numerous TRIM related bugs in many models of flash SSDs, which are just part the numerous bugs of many models of flash SSDs).

it is the O_PONIES issue again!

walex — Tue, 07 May 2019 21:23:14 +0000

«"How the hell did you write SEVEN HUNDRED TERABYTES to this drive in two years‽"

It was the Ceph journal drive.»

And that is also because the 950 EVO does not have a persistent (supercapacitor backed) cache, and thus all 700 TB will have hit the flash chips, even if a lot of it probably was just ephemeral. Anyhow using the 950 EVO as a Ceph journal device, especially with that high rate of journaling (38GB/hour), probably cost a lot in latency to Ceph.

There are two very different types of TRIM command

masoncl — Tue, 07 May 2019 21:19:42 +0000

I was talking about async discards in a slightly different context. Btrfs and ext4 will block transaction commit until we've finished trimming the things we deleted during the transaction. XFS will allow the commit to finish and let the trims continue floating down in the background, while making sure not to reuse the blocks until the trim is done.

Depending on the device, the async approach can be much faster, but it can also lead to a very large queue of discards, without any way for the application to wait for completion.

it is the O_PONIES issue again!

naptastic — Tue, 07 May 2019 17:11:14 +0000

I inherited a Samsung 950 Evo after it was retired from service after ~2 years. Once it was installed, I checked the smart data. Couldn't believe the "Total LBAs written" number.

"How the hell did you write SEVEN HUNDRED TERABYTES to this drive in two years‽"

It was the Ceph journal drive.

There are two very different types of TRIM command

walex — Tue, 07 May 2019 16:43:19 +0000

“XFS does discard asynchronously, while ext4 and Btrfs do it synchronously.“

The discussion throughout the article and here is made less useful by a vital omission: there is no mention that the first edition of the TRIM command for SATA was "blocking" ("synchronous"), but there is now a variant that is non-blocking ("asynchronous").

Essentially all the problems reported with 'discard' are due to the use of the first "blocking" variant, which unfortunately is the only one that has been implemented on most of the SATA flash SSD installed base so far. Wikipedia says:

“The original version of the TRIM command has been defined as a non-queued command by the T13 subcommittee, and consequently can incur massive execution penalty if used carelessly, e.g., if sent after each filesystem delete command. The non-queued nature of the command requires the driver to first wait for all outstanding commands to be finished, issue the TRIM command, then resume normal commands.”

SAS/SCSI and NVME have similar commands with different semantics, I particularly like the "write zeroes" command of NVME.

it is the O_PONIES issue again!

walex — Tue, 07 May 2019 16:27:44 +0000

“but the FTL can take an exorbitant amount of time when gigabytes of files are deleted; read and write performance can be affected.”

This is alluded to in the text by C Mason and others, but that is typical of devices that don't have a supercapacitor-backed cache/buffer: they must commit every delete to flash.

So called "enteprise" devices have supercapacitor backed caches, and can do deletes (and random writes) a lot faster. The situation is rather similar to RAID host-adapters with a cache, where a BBU makes a huge difference.

It is the famous O_PONIES and eternal september issue that never goes away, because every year there is a new batch of newbie sysadms and programmers who don't get persistence and caches, and just want O_PONIES.

People familiar with using SSDs for journaling in a Ceph storage layer know how enormous the difference made by having a supercapacitor backed SSD cache...

Issues around discard

shentino — Tue, 07 May 2019 13:52:58 +0000

Discard isn't just for SSDs.

In essence, discard is at this point a fundamental storage operation just like reads and writes.

LVM thin pools for example use high level discards as cues to deallocate committed pool space, which may well provoke the thin pool itself to cascade discards to its own storage.

It's also used in virtualization.

VMs that issue discards to virtual block devices can likewise provoke the hypervisor into deallocating storage or space occupied by whatever it stores on the host to back the device. A guest OS issuing discards to its virtual drives, even ones presented as "spinning rust", can help a hypervisor optimize how it manages the storage on the host.

Discards are a big opportunity for higher layers like this to give lower layers housekeeping opportunities beyond just letting an SSD garbage collect.

They should be liberally sent at every opportunity. If anything the overhead in managing them should encourage lower layers to take advantage of the information.

We should not look at discard as an uniform feature in the first place

hmh — Tue, 07 May 2019 11:47:11 +0000

Sometimes it feels like the real use for "TRIM" ("discard") on FLASH-based, old-style-storage devices with advanced FTLs (i.e. SATA or SAS-attached SSDs) is being forgotten. It is there to *reduce needless copying of stale pages of data* by the SSD itself, i.e. to reduce the need for background block writes. It is not an speedy way to delete data blocks, or if it is, someone forgot to properly notify the device vendors about it -- it is a FLASH endurance saver.

When you either TRIM or overwrite an LBA, the SSD gets the implied information that the old block is not going to be reused, and can be scheduled to be *erased*.

OTOH, when a filesystem prefers to direct writes to a new LBA and TRIM/discard is never done on the old, now-freed blocks, those old blocks are going to be copied around by the SSD firmware to free up erase blocks (much like memory compaction tries to do to create huge pages). This wastes FLASH write circles, increases on-device fragmentation, and reduces the number of "erased and ready to be used" FLASH pages. It also eventually renders the SSD into the dreaded "slow as an old floppy drive" state.

So, what "discard" is really useful for on [old-style non-NVMe?] SSDs is vastly different on why one would use "discard" on, e.g., a thin-provisioned volume. And it is *not* any less important.

Issues around discard

kdave — Tue, 07 May 2019 08:09:19 +0000

Martin Petersen said. If performance was bad for a device, the maker could recommend mounting without enabling discard; if the kernel developers had simply consistently enabled discard, vendors would have fixed their devices by now.

... ~~vendors would have fixed their devices~~ users would simply bug filesystem developers until discard is off by default, vendors doing nothing. No matter how much I'd like this approach to work, it does not work as expected in practice. I think vendors respond to $$$ and big companies asking for things, but for example see where we are with the erase block size. From our view it is a simple thing yet there has been no change AFAIK with the answers ranging from "trade secret" to "you don't need to know". And I'm afraid this won't change.

Issues around discard

fuhchee — Mon, 06 May 2019 18:01:15 +0000

Did the possibility of running online measurements of TRIM performance come up? The filesystem could learn the actual contemporaneous characteristics of various size TRIMs during lulls in operation.