How useful should copy_file_range() be?
The definition of copy_file_range() is:
ssize_t copy_file_range(int fd_in, loff_t *off_in, int fd_out, loff_t *off_out, size_t len, unsigned int flags);
Its job is to copy len bytes of data from the file represented by fd_in to fd_out, observing the requested offsets at both ends. The flags argument must be zero. This call first appeared in the 4.5 release. Over time it turned out to have a number of unpleasant bugs, leading to a long series of fixes and some significant grumbling along the way.
In 2019 Amir Goldstein fixed more issues and, in the process, removed a significant limitation: until then, copy_file_range() refused to copy between files that were not located on the same filesystem. After this patch was merged (for 5.3), it could copy between any two files, falling back on splice() for the cross-filesystem case. It appeared that copy_file_range() was finally settling into a solid and useful system call.
Indeed, it seemed useful enough that the Go developers decided to use it for the io.Copy() function in their standard library. Then they ran into a problem: copy_file_range() will, when given a kernel-generated file as input, copy zero bytes of data and claim success. These files, which include files in /proc, tracefs, and a large range of other virtual filesystems, generally indicate a length of zero when queried with a system call like stat(). copy_file_range(), seeing that zero length, concludes that there is no data to copy and the job is already done; it then returns success.
But there is actually data to be read from this kind of file, it just doesn't show in the advertised length of the file; the real length often cannot be known before the file is actually read. Before 5.3, the prohibition on cross-filesystem copies would have caused most such attempts to return an error code; afterward, they fail but appear to work. The kernel is happy, but some users can be surprisingly stubborn about actually wanting to copy the data they asked to be copied; they were rather less happy.
Marking virtual filesystems
Nicolas Boichat tried to mollify those users with this patch set to copy_file_range(). It added a flag (FS_GENERATED_CONTENT) to the file_system_type structure for virtual filesystems where the length of the files cannot be known in advance. copy_file_range() would then look for that flag and return an error code when it was found; the error return would cause io.Copy() to fall back to a manual copy operation. This change appeared to solve the immediate problem, but it is not destined to be merged into the mainline.
There were a few objections, starting with the fact that it requires all
virtual filesystems to be specially marked. The patch set did not mark
them all, and this mechanism would require that developers be sure to mark
all future filesystems as they were added. As Greg Kroah-Hartman put it: "That
way lies madness and constant auditing that I do not see anyone signing up
for for the next 20 years
".
The bigger question, though, was whether this behavior should be seen as a bug at all. Boichat described it as a regression; code that would fall back to a normal copy before 5.3 would silently fail to copy data thereafter. Kroah-Hartman was unsure, though; he continued:
Dave Chinner was, if anything, less sympathetic:
The use of it as a general file copy mechanism in the Go system library is incorrect and wrong. It is a userspace bug. Userspace has done the wrong thing, userspace needs to be fixed.
The problem with this attitude, as described by Go developer Ian Lance Taylor, is that figuring out when copy_file_range() can be used is not easy; he pointed out that these limitations are not mentioned in the copy_file_range() man page, and argued that this behavior reduces the utility of the system call considerably:
Chinner said
that the test is whether it is possible to tell whether a file has data in
it without calling read() on it. But Darrick Wong, hardly a
filesystem amateur, replied:
"I don't know how to do that, Dave. :)
" There is another fun
twist, as Boichat pointed
out: files in sysfs, rather than indicating a zero length, claim to be
4,096 bytes long — regardless of their true length, which may be larger or
smaller than that. Chinner's test will fail on those files, even if it can
be reliably carried out.
Toward a real fix
Wong went on to express agreement with the Go developers: copy_file_range() should either work as expected or return an error so that user space can know to fall back to copying the old-fashioned way. He also suggested a couple of ways to possibly fix the problem, the first of which was to go back to the previous state of affairs, where cross-filesystem copies were explicitly disallowed. Failing that, one could restrict such copies to a single filesystem type that has explicit support for them. Luis Henriques implemented a variant of that idea, where copies across filesystems would still be allowed if the two filesystems were of the same type, and if the filesystem involved explicitly implements the copy_file_range() operation.
That patch was stopped in its tracks, though, after Trond Myklebust pointed out that the kernel's NFS daemon uses the copy_file_range() mechanism to copy files between filesystems of different types. Blocking that would break some important functionality; this usage pattern exists in other filesystems, such as Ceph and FUSE, as well. In response to that, Henriques added a new flag (COPY_FILE_SPLICE) that could be used within the kernel to indicate that a cross-filesystem-type copy should be performed. There was some question of whether this flag should be made available to user space for cases when it somehow knows that the operation would succeed, but it seems that will not happen.
A final version of this patch has not been posted as of this writing, but the eventual shape of the fix seems clear. When called from user space, copy_file_range() will only try to copy a file across filesystems if the two are of the same type, and if that filesystem has explicit support for the system call (and, thus, is presumably written with all of the possible cases in mind). Otherwise, the call will fail with an explicit error, so user space will know that it must copy the data some other way. So copy_file_range() will never be a generic file-copy mechanism, but it will at least be possible to use robustly in code that is prepared for it to fail.
There is still one more trap lurking within copy_file_range(), though. Like most I/O-related system calls, copy_file_range() can copy fewer bytes than requested; user space needs to check the return value to see how much work was actually done. There is currently no way to distinguish between copies that were cut short on the read side (by hitting the end of the file, perhaps) and those that were stopped on the write side (which may well indicate a write error). Nobody has come up with a real solution to that problem yet.
All of this goes to show how a seemingly simple interface can quickly
become complex. copy_file_range() has revealed a number of sharp
edges over its relatively short existence; there may well be more yet to be
found. It is thus perhaps unsurprising that the kernel developers, having
been burned more than once, feel a strong desire to keep its implementation
as simple as possible.
Index entries for this article | |
---|---|
Kernel | System calls/copy_file_range() |
Posted Feb 18, 2021 15:44 UTC (Thu)
by dullfire (guest, #111432)
[Link] (2 responses)
However if that was really desired: I imagine a fairly simple way to do that is: add a flag, something like CFR_UPDATE_OFF_SIZE. Setting it would make the kernel update the loff_t's (pointed to by off_in and off_out) with the bytes read/written correspondingly. Userspace can easily tell which side failed then. If the two sides are equal, it was a read size failure, if the read side is greater, it was a write failure. Note that it is a logic error is the write size were to be greater than the read size.
Posted Feb 18, 2021 15:48 UTC (Thu)
by dullfire (guest, #111432)
[Link] (1 responses)
So either the author was mistaken and it's not actually an issue, or the kernel doesn't update the read-size loff_t when there is a write failure, in which case the flag would just change this behavior
Posted Feb 18, 2021 16:43 UTC (Thu)
by matthias (subscriber, #94967)
[Link]
Posted Feb 18, 2021 16:36 UTC (Thu)
by zuzzurro (subscriber, #61118)
[Link] (8 responses)
Posted Feb 18, 2021 16:41 UTC (Thu)
by zuzzurro (subscriber, #61118)
[Link] (6 responses)
Posted Feb 18, 2021 18:38 UTC (Thu)
by smurf (subscriber, #17840)
[Link] (5 responses)
Posted Feb 20, 2021 12:16 UTC (Sat)
by jengelh (guest, #33263)
[Link] (4 responses)
I am not getting that with the /proc/self/sched file (uses `seq_lseek`, which operates in terms of records). Is sys_llseek *meant* to be POSIX-compatible?
Posted Feb 20, 2021 14:45 UTC (Sat)
by smurf (subscriber, #17840)
[Link]
So the bug is on the user. If you treat these things like ordinary files and expect all the posicky corner cases to work "correctly", you're SOL. These files will never have POSIX semantics. No, you can't use libc to emulate it. Deal.
Yes there should be a way to ask the kernel whether a file conforms to 100% posix. Well, we don't have that. Deal.
One possible workaround is to check the file size. If it's smaller than pagesize*4 or so then it's probably cheaper to copy its data the old-fashioned way anyway.
Posted Feb 20, 2021 17:01 UTC (Sat)
by markh (subscriber, #33984)
[Link] (2 responses)
Posted Feb 23, 2021 19:02 UTC (Tue)
by jsmith45 (guest, #125263)
[Link] (1 responses)
Posted Feb 25, 2021 11:12 UTC (Thu)
by zuzzurro (subscriber, #61118)
[Link]
Posted Feb 19, 2021 17:33 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link]
Posted Feb 18, 2021 17:31 UTC (Thu)
by dezgeg (subscriber, #92243)
[Link]
Posted Feb 18, 2021 18:24 UTC (Thu)
by Deewiant (subscriber, #97394)
[Link] (12 responses)
Posted Feb 18, 2021 19:21 UTC (Thu)
by JoeBuck (subscriber, #2330)
[Link] (3 responses)
Posted Feb 18, 2021 22:37 UTC (Thu)
by drinkcat (subscriber, #106553)
[Link] (2 responses)
Posted Feb 18, 2021 23:49 UTC (Thu)
by JoeBuck (subscriber, #2330)
[Link] (1 responses)
Posted Feb 19, 2021 0:17 UTC (Fri)
by drinkcat (subscriber, #106553)
[Link]
Posted Feb 19, 2021 7:14 UTC (Fri)
by fw (subscriber, #26023)
[Link] (7 responses)
If you can just close the descriptors and report an error (because the function opened them locally), these issues do not apply.
Posted Feb 19, 2021 7:32 UTC (Fri)
by Deewiant (subscriber, #97394)
[Link] (6 responses)
Looks like it starts by checking the underlying file type (with a stat() if necessary) and only tries copy_file_range on regular files of nonzero size (lines 122 and 168; and 284 and 469 for the logic leading to the stat() itself), while still falling back to other methods if copy_file_range only copied zero bytes (lines 175 and 563). Overall there seems to be a lot of logic around keeping track of what was actually written vs. what was reported by the syscalls.
Posted Feb 19, 2021 18:55 UTC (Fri)
by fw (subscriber, #26023)
[Link]
On the other hand, it is still deeply nested within the library, and it is not immediately obvious whether the callers of the universal copy routines expect consistent file offsets on errors. For a function that is directly modeled on the system call (which seems to be the case for Go), predictable file offset behavior seems quite important.
Posted Feb 19, 2021 20:29 UTC (Fri)
by the8472 (guest, #144969)
[Link] (4 responses)
You're looking at the linux implementation for io::copy which can copy arbitrary readers to writers but specializes to various syscalls when types wrapping file descriptors of unknown type are passed, hence the complexity. There also is fs::copy which shares some code but has fewer cases to cover and does fewer checks before invoking The relevant parts are here and here
Posted Jul 10, 2021 12:57 UTC (Sat)
by jaykrell (guest, #153190)
[Link] (3 responses)
Not everything is a file!
In fact, most things are not a file.
Posted Jul 10, 2021 13:54 UTC (Sat)
by mpr22 (subscriber, #60784)
[Link]
Posted Jul 10, 2021 21:48 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Jul 10, 2021 23:16 UTC (Sat)
by flussence (guest, #85566)
[Link]
You're being facetious, but it's occasionally very useful to be able to do things like check which port a monitor is plugged in on or framedump the console on a server using only ssh.
How useful should copy_file_range() be?
How useful should copy_file_range() be?
How useful should copy_file_range() be?
How useful should copy_file_range() be?
I guess it's too late to do the same anymore...
How useful should copy_file_range() be?
How useful should copy_file_range() be?
How useful should copy_file_range() be?
If yes, it's a kernel bug.
If no, then it's a libc bug because it failed to provide/emulate POSIX semantics on top of an (unposixy) kernel interface.
So, which is it, where should I file a bug?
How useful should copy_file_range() be?
How useful should copy_file_range() be?
How useful should copy_file_range() be?
How useful should copy_file_range() be?
How useful should copy_file_range() be?
How useful should copy_file_range() be?
How useful should copy_file_range() be?
How useful should copy_file_range() be?
How useful should copy_file_range() be?
How useful should copy_file_range() be?
How useful should copy_file_range() be?
The Go implementation works on file descriptors, not files. Full userspace emulation of How useful should copy_file_range() be?
copy_file_range
is very hard: append-only output files, non-seekable input files that cannot restore the correct input read position after an output failure, other error conditions that are not recoverable because system calls fail during rollback. If it's difficult for the kernel, it's probably hard for userspace, too.
How useful should copy_file_range() be?
I see, there is a different code path that reaches the How useful should copy_file_range() be?
copy_file_range
system call.
How useful should copy_file_range() be?
copy_file_range
since it's only meant to copy entire regular files.fs::copy
case, it just tries copy_file_range
and then falls back in various cases, including 0 byte reads.
How useful should copy_file_range() be?
This data should be retrieved through strongly typed special purpose function calls.
Or at least, perhaps, allow open, for a hierarchical namespace, but not read, and reveal all information via ioctl.
Is the screen a file?
Is the keyboard a file?
Are sockets files? They have read/write, but how about seek and mmap?
How useful should copy_file_range() be?
How useful should copy_file_range() be?
Windows NT tried that. It failed miserably.
How useful should copy_file_range() be?
>Is the keyboard a file?