Supporting NFS v4.2 WRITE_SAME
At the 2025 Linux Storage, Filesystem, Memory Management, and BPF Summit (LSFMM+BPF), Anna Schumaker led a discussion about implementing the NFS v4.2 WRITE_SAME command in both the NFS client and server. WRITE_SAME is meant to write large amounts of identical data (e.g. zeroes) to the server without actually needing to transfer all of it over the wire. In her topic proposal, Schumaker wondered whether other filesystems needed the functionality, so that it should be implemented at the virtual filesystem (VFS) layer, or whether it should simply be handled as an NFS-specific ioctl().
The NFS WRITE_SAME operation was partly inspired by the SCSI WRITE
SAME command, she began; it is "intended for databases to be
able to initialize a bulk of records all at once
". It offloads much of
the work to the server side. So far, Schumaker has been implementing
WRITE_SAME with an ioctl() using a structure that looks
similar to the application
data block structure defined in the NFS v4.2 RFC for use by
WRITE_SAME.
On the server side, it would make sense to have a function that gets called to process the WRITE_SAME command, but it would be nice if that same function was available to clients; they could use it as a fallback when the server does not implement WRITE_SAME. Other filesystems could potentially also use the functionality, either with the SCSI WRITE SAME or for some other filesystem-specific use case.
The application data block allows for WRITE_SAME commands that
write various patterns to the storage, but Christoph Hellwig suggested that
all of that complexity should be avoided. He was responsible
for writing the
WRITE_SAME definition for NFS and for killing off the Linux block-layer
support for the SCSI WRITE SAME patterns; "don't do it
", he
said with laugh. WRITE_SAME for zeroing is "perfectly
fine
", SCSI supports that, but "exposing all the detailed, crazy
patterns
" is "not sane
". Getting the semantics right for all of
the different cases is extremely difficult. Schumaker said that sounded
reasonable.
There is already an API available for clients to use, Amir Goldstein said: fallocate() with the FALLOC_FL_ZERO_RANGE flag. Schumaker said that NFS did not have support for that flag, but Goldstein said that support could be added as the way to provide access to WRITE_SAME. Hellwig said that there was a patch set that he had not yet looked at closely to add an FALLOC_FL_WRITE_ZEROES flag that would force the zeroes to be written; it might be a better API for WRITE_SAME. That series is now on v5 and seems to be progressing toward inclusion.
Matthew Wilcox wondered whether only being able to write zeroes would make
the WRITE_SAME feature less than entirely useful; he remembered a
"a certain amount of pushback because databases need a specific
pattern
". There was a fair amount of joking about which of the two
Oracle databases (the other being MySQL) he meant; Wilcox works for Oracle,
as does Schumaker, who seemed to indicate that she had talked to the MySQL
group. In the end, someone seemed to sum up that only supporting zeroing
is reasonable: "zeroes are good
".
Chuck Lever, who also works for Oracle, said that he had spoken to the
Oracle database group. That database does not use the Linux NFS client, so
the group did not care about support for WRITE_SAME in the client.
The group's concern was mostly about support for WRITE_SAME in
proprietary NFS servers, he said. Wilcox asked: "and Linux NFS
servers?
" Lever said that Oracle databases do not deploy on systems
that use those.
| Index entries for this article | |
|---|---|
| Kernel | Filesystems/NFS |
| Conference | Storage, Filesystem, Memory-Management and BPF Summit/2025 |
