Supporting NFS v4.2 WRITE_SAME
At the 2025 Linux Storage, Filesystem, Memory Management, and BPF Summit (LSFMM+BPF), Anna Schumaker led a discussion about implementing the NFS v4.2 WRITE_SAME command in both the NFS client and server. WRITE_SAME is meant to write large amounts of identical data (e.g. zeroes) to the server without actually needing to transfer all of it over the wire. In her topic proposal, Schumaker wondered whether other filesystems needed the functionality, so that it should be implemented at the virtual filesystem (VFS) layer, or whether it should simply be handled as an NFS-specific ioctl().
The NFS WRITE_SAME operation was partly inspired by the SCSI WRITE
SAME command, she began; it is "intended for databases to be
able to initialize a bulk of records all at once
". It offloads much of
the work to the server side. So far, Schumaker has been implementing
WRITE_SAME with an ioctl() using a structure that looks
similar to the application
data block structure defined in the NFS v4.2 RFC for use by
WRITE_SAME.
![Anna Schumaker [Anna Schumaker]](https://static.lwn.net/images/2025/lsfmb-schumaker-sm.png)
On the server side, it would make sense to have a function that gets called to process the WRITE_SAME command, but it would be nice if that same function was available to clients; they could use it as a fallback when the server does not implement WRITE_SAME. Other filesystems could potentially also use the functionality, either with the SCSI WRITE SAME or for some other filesystem-specific use case.
The application data block allows for WRITE_SAME commands that
write various patterns to the storage, but Christoph Hellwig suggested that
all of that complexity should be avoided. He was responsible
for writing the
WRITE_SAME definition for NFS and for killing off the Linux block-layer
support for the SCSI WRITE SAME patterns; "don't do it
", he
said with laugh. WRITE_SAME for zeroing is "perfectly
fine
", SCSI supports that, but "exposing all the detailed, crazy
patterns
" is "not sane
". Getting the semantics right for all of
the different cases is extremely difficult. Schumaker said that sounded
reasonable.
There is already an API available for clients to use, Amir Goldstein said: fallocate() with the FALLOC_FL_ZERO_RANGE flag. Schumaker said that NFS did not have support for that flag, but Goldstein said that support could be added as the way to provide access to WRITE_SAME. Hellwig said that there was a patch set that he had not yet looked at closely to add an FALLOC_FL_WRITE_ZEROES flag that would force the zeroes to be written; it might be a better API for WRITE_SAME. That series is now on v5 and seems to be progressing toward inclusion.
Matthew Wilcox wondered whether only being able to write zeroes would make
the WRITE_SAME feature less than entirely useful; he remembered a
"a certain amount of pushback because databases need a specific
pattern
". There was a fair amount of joking about which of the two
Oracle databases (the other being MySQL) he meant; Wilcox works for Oracle,
as does Schumaker, who seemed to indicate that she had talked to the MySQL
group. In the end, someone seemed to sum up that only supporting zeroing
is reasonable: "zeroes are good
".
Chuck Lever, who also works for Oracle, said that he had spoken to the
Oracle database group. That database does not use the Linux NFS client, so
the group did not care about support for WRITE_SAME in the client.
The group's concern was mostly about support for WRITE_SAME in
proprietary NFS servers, he said. Wilcox asked: "and Linux NFS
servers?
" Lever said that Oracle databases do not deploy on systems
that use those.
Index entries for this article | |
---|---|
Kernel | Filesystems/NFS |
Conference | Storage, Filesystem, Memory-Management and BPF Summit/2025 |
Posted Jun 16, 2025 15:50 UTC (Mon)
by epa (subscriber, #39769)
[Link] (13 responses)
If the NFS server wants to pass that down to a lower level (such as the SCSI
Posted Jun 16, 2025 20:34 UTC (Mon)
by jreiser (subscriber, #11027)
[Link] (12 responses)
Posted Jun 16, 2025 20:37 UTC (Mon)
by mb (subscriber, #50428)
[Link] (4 responses)
Posted Jun 17, 2025 15:03 UTC (Tue)
by dsfch (subscriber, #176007)
[Link] (3 responses)
Posted Jun 17, 2025 15:55 UTC (Tue)
by mb (subscriber, #50428)
[Link] (2 responses)
Posted Jun 17, 2025 17:43 UTC (Tue)
by adobriyan (subscriber, #30858)
[Link] (1 responses)
Wear leveling can be defeated. Also, pattern is irrelevant and better not be used -- SSDs scramble data before writing to NAND.
I'd claim that in practice, 2 full capacities random(!) writes of host-generated random data 1 LBA at a time is the best regular folks could do.
After first full capacity run SSD have no choice but to keep the data. Second run most certainly overwrites overcapacity.
Random writes ensure maximum GC pressure.
Of course, there are many assumptions like SSD is not chopped into namespaces and you see only one of many.
Posted Jun 18, 2025 9:17 UTC (Wed)
by farnz (subscriber, #17727)
[Link]
Additionally, there are advanced attacks on flash (if you have to care about state-level actors) which depend on the fact that the NAND cell itself isn't perfectly quantized; the exact level set when you write is a function both of the write request, and of previous values, and a sufficiently determined attacker can reconstruct old values to a degree that lets the ECC kick in and restore the data. This is a seriously advanced threat model, that most of us don't even have to consider; if it applies to you, your contacts at your national security agencies will be able to help you further (and if you don't have routine contact with your national security agency, this is not a threat that applies to you).
Thus, you've got four good ways (depending on threat model) to handle preventing a discarded SSD from having its data extracted (from least secure to most):
Given how simple it is to encrypt storage on Linux, with the key stored elsewhere (or generated from a passphrase via a strong key derivation process like argon2id), I'd strongly suggest that option 3 is the right answer for almost everyone, except those whose national security agencies recommend option 4 to them.
Posted Jun 16, 2025 20:40 UTC (Mon)
by butlerm (subscriber, #13312)
[Link] (6 responses)
Posted Jun 16, 2025 22:03 UTC (Mon)
by ejr (subscriber, #51652)
[Link] (1 responses)
But, yeah, I feel that all non-end-to-end is lost at that point anyways.
Posted Jun 17, 2025 8:20 UTC (Tue)
by farnz (subscriber, #17727)
[Link]
Posted Jun 17, 2025 9:41 UTC (Tue)
by paulj (subscriber, #341)
[Link]
Posted Jun 17, 2025 23:39 UTC (Tue)
by smooth1x (guest, #25322)
[Link] (2 responses)
Why could we not use fstrim for that already?
Posted Jun 18, 2025 6:23 UTC (Wed)
by Wol (subscriber, #4433)
[Link]
The idea being, if the disk has been completely written to, but is only say 1/3 full with real data, the disk will grab a block that's been 100% trimmed and just do a "wipe and write". If it hadn't been trimmed, it wouldn't know that the "data" is garbage, and would have had to do a "salvage what I'm not overwriting, wipe, and rewrite". Which drives wear through the roof and rewrite times through the floor.
Cheers,
Posted Jun 18, 2025 9:20 UTC (Wed)
by farnz (subscriber, #17727)
[Link]
As a result, fstrim doesn't do what a hypothetical SHRED command would; that said, the best way to implement such a command would be to encrypt the data with a secure cryptosystem, and simply lose the key when it's time to destroy the data, which (for NFS) you can do with an overlay filesystem like ecryptfs or similar (noting that many of these have bugs that you'd have to fix first).
Posted Jun 17, 2025 9:59 UTC (Tue)
by claudex (subscriber, #92510)
[Link] (1 responses)
Posted Jun 17, 2025 13:26 UTC (Tue)
by farnz (subscriber, #17727)
[Link]
Reducing data over the wire
If that's really the concern, then shouldn't NFS support optional compression with deflate (gzip) or some equally boring compression method? Not all requests would have to be compressed but if you have one of these "write gigabytes of the same pattern" ones you could choose to compress it. (The client could even use some cunning method to generate the compressed request more efficiently, with knowledge of the deflate file format.)
WRITE_SAME
is meant to write large amounts of identical data (e.g. zeroes) to the server without actually needing to transfer all of it over the wire.WRITE_SAME
) then a general purpose compression scheme would not make that particularly easy. But if it really is just about reducing the wire traffic...
Reducing data over the wire
Reducing data over the wire
Reducing data over the wire
If your usecase is to e.g. build virtual machine "disk" images on NFS, you might like the idea of fast-clearing (to "whatever standard pattern", not just to-zeroes). And one could then well say it'd be nice if an ioctl to do so done on the loopback device that maps to a file on NFS would "passthrough" to as fast operation on NFS.
Reducing data over the wire
If you overwrite a file with patterns on COW filesystems, it won't actually overwrite the data on disk.
If pattern overwriting makes sense, then only on raw devices. And even for raw devices there's no guarantee it will actually overwrite everything with the pattern. On SSD it's pretty much guaranteed it won't do it in the order you requested due to wear leveling.
Reducing data over the wire
No sane manufacturer does 100% overcapacity, so 2 runs should be fine.
The trouble with overwriting as a way to destroy the data is that you're dependent on the SSD firmware not being malicious; if it's programmed to detect "interesting" data and ensure it's kept across GC runs, then no amount of rewriting the SSD will help - at best, you'll trigger it entering "failure" state as it runs out of usable NAND blocks, and then being able to go back into read-only mode when the attacker tells it to.
Reducing data over the wire
Reducing data over the wire
Reducing data over the wire
It's not that much more of a challenge to intercept a set of WRITE SAME commands as compared to intercepting a SHRED command; if command interception is a concern, you need to handle that at a different layer (e.g. TLS).
Reducing data over the wire
Reducing data over the wire
Reducing data over the wire
Reducing data over the wire
Wol
WRITE_SAME is the NFS command used to implement fstrim. The problem with it is that it doesn't oblige the underlying device to destroy the data, just to read back as the given pattern (usually all-zeroes); on SSDs, this is usually done by leaving the data alone, and updating the mapping table to say that the given LBAs should now read back as the supplied pattern.
Reducing data over the wire
SCSI command
There are SAS SSDs, like the Samsung PM1643 and PM1653 families, which support WRITE SAME.
SCSI command