|
|
Subscribe / Log in / New account

Supporting NFS v4.2 WRITE_SAME

By Jake Edge
June 16, 2025

LSFMM+BPF

At the 2025 Linux Storage, Filesystem, Memory Management, and BPF Summit (LSFMM+BPF), Anna Schumaker led a discussion about implementing the NFS v4.2 WRITE_SAME command in both the NFS client and server. WRITE_SAME is meant to write large amounts of identical data (e.g. zeroes) to the server without actually needing to transfer all of it over the wire. In her topic proposal, Schumaker wondered whether other filesystems needed the functionality, so that it should be implemented at the virtual filesystem (VFS) layer, or whether it should simply be handled as an NFS-specific ioctl().

The NFS WRITE_SAME operation was partly inspired by the SCSI WRITE SAME command, she began; it is "intended for databases to be able to initialize a bulk of records all at once". It offloads much of the work to the server side. So far, Schumaker has been implementing WRITE_SAME with an ioctl() using a structure that looks similar to the application data block structure defined in the NFS v4.2 RFC for use by WRITE_SAME.

[Anna Schumaker]

On the server side, it would make sense to have a function that gets called to process the WRITE_SAME command, but it would be nice if that same function was available to clients; they could use it as a fallback when the server does not implement WRITE_SAME. Other filesystems could potentially also use the functionality, either with the SCSI WRITE SAME or for some other filesystem-specific use case.

The application data block allows for WRITE_SAME commands that write various patterns to the storage, but Christoph Hellwig suggested that all of that complexity should be avoided. He was responsible for writing the WRITE_SAME definition for NFS and for killing off the Linux block-layer support for the SCSI WRITE SAME patterns; "don't do it", he said with laugh. WRITE_SAME for zeroing is "perfectly fine", SCSI supports that, but "exposing all the detailed, crazy patterns" is "not sane". Getting the semantics right for all of the different cases is extremely difficult. Schumaker said that sounded reasonable.

There is already an API available for clients to use, Amir Goldstein said: fallocate() with the FALLOC_FL_ZERO_RANGE flag. Schumaker said that NFS did not have support for that flag, but Goldstein said that support could be added as the way to provide access to WRITE_SAME. Hellwig said that there was a patch set that he had not yet looked at closely to add an FALLOC_FL_WRITE_ZEROES flag that would force the zeroes to be written; it might be a better API for WRITE_SAME. That series is now on v5 and seems to be progressing toward inclusion.

Matthew Wilcox wondered whether only being able to write zeroes would make the WRITE_SAME feature less than entirely useful; he remembered a "a certain amount of pushback because databases need a specific pattern". There was a fair amount of joking about which of the two Oracle databases (the other being MySQL) he meant; Wilcox works for Oracle, as does Schumaker, who seemed to indicate that she had talked to the MySQL group. In the end, someone seemed to sum up that only supporting zeroing is reasonable: "zeroes are good".

Chuck Lever, who also works for Oracle, said that he had spoken to the Oracle database group. That database does not use the Linux NFS client, so the group did not care about support for WRITE_SAME in the client. The group's concern was mostly about support for WRITE_SAME in proprietary NFS servers, he said. Wilcox asked: "and Linux NFS servers?" Lever said that Oracle databases do not deploy on systems that use those.


Index entries for this article
KernelFilesystems/NFS
ConferenceStorage, Filesystem, Memory-Management and BPF Summit/2025


to post comments

Reducing data over the wire

Posted Jun 16, 2025 15:50 UTC (Mon) by epa (subscriber, #39769) [Link] (13 responses)

WRITE_SAME is meant to write large amounts of identical data (e.g. zeroes) to the server without actually needing to transfer all of it over the wire.
If that's really the concern, then shouldn't NFS support optional compression with deflate (gzip) or some equally boring compression method? Not all requests would have to be compressed but if you have one of these "write gigabytes of the same pattern" ones you could choose to compress it. (The client could even use some cunning method to generate the compressed request more efficiently, with knowledge of the deflate file format.)

If the NFS server wants to pass that down to a lower level (such as the SCSI WRITE_SAME) then a general purpose compression scheme would not make that particularly easy. But if it really is just about reducing the wire traffic...

Reducing data over the wire

Posted Jun 16, 2025 20:34 UTC (Mon) by jreiser (subscriber, #11027) [Link] (12 responses)

I strongly agree that deflate (gzip) should be supported. Zeroes-only is lame, and also does not meet the requirements for NIST 800-88 Standard For Drive Erasure. Seven passes are required: zeroes, ones, bitwise alternating zeroes and ones, bitwise alternating ones and zeroes, zeroes, ones, zeroes. Zeroes-only misses 4 of those passes.

Reducing data over the wire

Posted Jun 16, 2025 20:37 UTC (Mon) by mb (subscriber, #50428) [Link] (4 responses)

Why would anybody want to run *drive* erasure patterns on a network file system?

Reducing data over the wire

Posted Jun 17, 2025 15:03 UTC (Tue) by dsfch (subscriber, #176007) [Link] (3 responses)

People "layer" things.
If your usecase is to e.g. build virtual machine "disk" images on NFS, you might like the idea of fast-clearing (to "whatever standard pattern", not just to-zeroes). And one could then well say it'd be nice if an ioctl to do so done on the loopback device that maps to a file on NFS would "passthrough" to as fast operation on NFS.

Reducing data over the wire

Posted Jun 17, 2025 15:55 UTC (Tue) by mb (subscriber, #50428) [Link] (2 responses)

That doesn't even work on normal disk filesystems.
If you overwrite a file with patterns on COW filesystems, it won't actually overwrite the data on disk.
If pattern overwriting makes sense, then only on raw devices. And even for raw devices there's no guarantee it will actually overwrite everything with the pattern. On SSD it's pretty much guaranteed it won't do it in the order you requested due to wear leveling.

Reducing data over the wire

Posted Jun 17, 2025 17:43 UTC (Tue) by adobriyan (subscriber, #30858) [Link] (1 responses)

> On SSD it's pretty much guaranteed it won't do it in the order you requested due to wear leveling.

Wear leveling can be defeated. Also, pattern is irrelevant and better not be used -- SSDs scramble data before writing to NAND.

I'd claim that in practice, 2 full capacities random(!) writes of host-generated random data 1 LBA at a time is the best regular folks could do.

After first full capacity run SSD have no choice but to keep the data. Second run most certainly overwrites overcapacity.
No sane manufacturer does 100% overcapacity, so 2 runs should be fine.

Random writes ensure maximum GC pressure.

Of course, there are many assumptions like SSD is not chopped into namespaces and you see only one of many.

Reducing data over the wire

Posted Jun 18, 2025 9:17 UTC (Wed) by farnz (subscriber, #17727) [Link]

The trouble with overwriting as a way to destroy the data is that you're dependent on the SSD firmware not being malicious; if it's programmed to detect "interesting" data and ensure it's kept across GC runs, then no amount of rewriting the SSD will help - at best, you'll trigger it entering "failure" state as it runs out of usable NAND blocks, and then being able to go back into read-only mode when the attacker tells it to.

Additionally, there are advanced attacks on flash (if you have to care about state-level actors) which depend on the fact that the NAND cell itself isn't perfectly quantized; the exact level set when you write is a function both of the write request, and of previous values, and a sufficiently determined attacker can reconstruct old values to a degree that lets the ECC kick in and restore the data. This is a seriously advanced threat model, that most of us don't even have to consider; if it applies to you, your contacts at your national security agencies will be able to help you further (and if you don't have routine contact with your national security agency, this is not a threat that applies to you).

Thus, you've got four good ways (depending on threat model) to handle preventing a discarded SSD from having its data extracted (from least secure to most):

  1. Use WRITE_SAME or other discard mechanisms to remove the data from normal visibility. This protects you against accidental data recovery, but does not protect you from an attacker willing to disassemble the SSD and extract the raw flash data, nor does it protect you from rollback bugs in the SSD, and is probably good enough for most people.
  2. Overwrite the entire device as you've described. This protects you from non-malicious SSD firmware, but runs the risk of malicious SSD firmware saving the data for an attacker willing to disassemble the SSD and extract the raw flash data.
  3. Follow paulj's suggestion and encrypt the contents with a key stored off-SSD (e.g. a long passphrase, or key in a TPM2), and arrange to destroy the key beyond recovery when you want to lose the data. As long as you choose a cryptosystem that's strong enough, this protects you against any plausible attacker - they can recover the ciphertext, but without the key, they can't get the plaintext.
  4. Physically destroy the SSD - run it through an industrial shredder, so that it's just an "ore" to extract raw materials from, and not complete devices any more. If you can't get to the individual devices, it's impossible to read them out.

Given how simple it is to encrypt storage on Linux, with the key stored elsewhere (or generated from a passphrase via a strong key derivation process like argon2id), I'd strongly suggest that option 3 is the right answer for almost everyone, except those whose national security agencies recommend option 4 to them.

Reducing data over the wire

Posted Jun 16, 2025 20:40 UTC (Mon) by butlerm (subscriber, #13312) [Link] (6 responses)

They should perhaps make a SHRED or OBLITERATE command for that.

Reducing data over the wire

Posted Jun 16, 2025 22:03 UTC (Mon) by ejr (subscriber, #51652) [Link] (1 responses)

That depends on the threat model. A single, well-known command can be intercepted.

But, yeah, I feel that all non-end-to-end is lost at that point anyways.

Reducing data over the wire

Posted Jun 17, 2025 8:20 UTC (Tue) by farnz (subscriber, #17727) [Link]

It's not that much more of a challenge to intercept a set of WRITE SAME commands as compared to intercepting a SHRED command; if command interception is a concern, you need to handle that at a different layer (e.g. TLS).

Reducing data over the wire

Posted Jun 17, 2025 9:41 UTC (Tue) by paulj (subscriber, #341) [Link]

The best way to securely erase a drive is to have the data on it encrypted from day 1. Then you only need to securely 'lose' 128 to 256 bits of the key.

Reducing data over the wire

Posted Jun 17, 2025 23:39 UTC (Tue) by smooth1x (guest, #25322) [Link] (2 responses)

Why could we not use fstrim for that already?

Reducing data over the wire

Posted Jun 18, 2025 6:23 UTC (Wed) by Wol (subscriber, #4433) [Link]

As I understand it, fstrim merely tells the layer below "this is no longer wanted". It doesn't wipe it, it just says "feel free to wipe it if you want to".

The idea being, if the disk has been completely written to, but is only say 1/3 full with real data, the disk will grab a block that's been 100% trimmed and just do a "wipe and write". If it hadn't been trimmed, it wouldn't know that the "data" is garbage, and would have had to do a "salvage what I'm not overwriting, wipe, and rewrite". Which drives wear through the roof and rewrite times through the floor.

Cheers,
Wol

Reducing data over the wire

Posted Jun 18, 2025 9:20 UTC (Wed) by farnz (subscriber, #17727) [Link]

WRITE_SAME is the NFS command used to implement fstrim. The problem with it is that it doesn't oblige the underlying device to destroy the data, just to read back as the given pattern (usually all-zeroes); on SSDs, this is usually done by leaving the data alone, and updating the mapping table to say that the given LBAs should now read back as the supplied pattern.

As a result, fstrim doesn't do what a hypothetical SHRED command would; that said, the best way to implement such a command would be to encrypt the data with a secure cryptosystem, and simply lose the key when it's time to destroy the data, which (for NFS) you can do with an overlay filesystem like ecryptfs or similar (noting that many of these have bugs that you'd have to fix first).

SCSI command

Posted Jun 17, 2025 9:59 UTC (Tue) by claudex (subscriber, #92510) [Link] (1 responses)

Is the SCSI WRITE SAME command supported by some hardware or it is only implemented on virtual disk and iSCSI target ?

SCSI command

Posted Jun 17, 2025 13:26 UTC (Tue) by farnz (subscriber, #17727) [Link]

There are SAS SSDs, like the Samsung PM1643 and PM1653 families, which support WRITE SAME.


Copyright © 2025, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds