How useful should copy_file_range() be?

By Jonathan Corbet
February 18, 2021

The copy_file_range() system call looks like a relatively straightforward feature; it allows user space to ask the kernel to copy a range of data from one file to another, hopefully applying some optimizations along the way. In truth, this call has never been as generic as it seems, though some changes made during 5.3 helped in that regard. When the developers of the Go language ran into problems with copy_file_range(), there ensued a lengthy discussion on how this system call should work and whether the kernel needs to do more to make it useful.

The definition of copy_file_range() is:

    ssize_t copy_file_range(int fd_in, loff_t *off_in,
                            int fd_out, loff_t *off_out,
                            size_t len, unsigned int flags);

Its job is to copy len bytes of data from the file represented by fd_in to fd_out, observing the requested offsets at both ends. The flags argument must be zero. This call first appeared in the 4.5 release. Over time it turned out to have a number of unpleasant bugs, leading to a long series of fixes and some significant grumbling along the way.

In 2019 Amir Goldstein fixed more issues and, in the process, removed a significant limitation: until then, copy_file_range() refused to copy between files that were not located on the same filesystem. After this patch was merged (for 5.3), it could copy between any two files, falling back on splice() for the cross-filesystem case. It appeared that copy_file_range() was finally settling into a solid and useful system call.

Indeed, it seemed useful enough that the Go developers decided to use it for the io.Copy() function in their standard library. Then they ran into a problem: copy_file_range() will, when given a kernel-generated file as input, copy zero bytes of data and claim success. These files, which include files in /proc, tracefs, and a large range of other virtual filesystems, generally indicate a length of zero when queried with a system call like stat(). copy_file_range(), seeing that zero length, concludes that there is no data to copy and the job is already done; it then returns success.

But there is actually data to be read from this kind of file, it just doesn't show in the advertised length of the file; the real length often cannot be known before the file is actually read. Before 5.3, the prohibition on cross-filesystem copies would have caused most such attempts to return an error code; afterward, they fail but appear to work. The kernel is happy, but some users can be surprisingly stubborn about actually wanting to copy the data they asked to be copied; they were rather less happy.

Marking virtual filesystems

Nicolas Boichat tried to mollify those users with this patch set to copy_file_range(). It added a flag (FS_GENERATED_CONTENT) to the file_system_type structure for virtual filesystems where the length of the files cannot be known in advance. copy_file_range() would then look for that flag and return an error code when it was found; the error return would cause io.Copy() to fall back to a manual copy operation. This change appeared to solve the immediate problem, but it is not destined to be merged into the mainline.

There were a few objections, starting with the fact that it requires all virtual filesystems to be specially marked. The patch set did not mark them all, and this mechanism would require that developers be sure to mark all future filesystems as they were added. As Greg Kroah-Hartman put it: "That way lies madness and constant auditing that I do not see anyone signing up for for the next 20 years".

The bigger question, though, was whether this behavior should be seen as a bug at all. Boichat described it as a regression; code that would fall back to a normal copy before 5.3 would silently fail to copy data thereafter. Kroah-Hartman was unsure, though; he continued:

Why are people trying to use copy_file_range on simple /proc and /sys files in the first place? They can not seek (well most can not), so that feels like a "oh look, a new syscall, let's use it everywhere!" problem that userspace should not do.

Dave Chinner was, if anything, less sympathetic:

It is a targeted solution for *regular files only* on filesystems that store persistent data and can accelerate the data copy in some way (e.g. clone, server side offload, hardware offload, etc). It is not intended as a copy mechanism for copying data from one random file descriptor to another.

The use of it as a general file copy mechanism in the Go system library is incorrect and wrong. It is a userspace bug. Userspace has done the wrong thing, userspace needs to be fixed.

The problem with this attitude, as described by Go developer Ian Lance Taylor, is that figuring out when copy_file_range() can be used is not easy; he pointed out that these limitations are not mentioned in the copy_file_range() man page, and argued that this behavior reduces the utility of the system call considerably:

From my perspective, as a kernel user rather than a kernel developer, a system call that silently fails for certain files and that provides no way to determine either 1) ahead of time that the system call will fail, or 2) after the call that the system call did fail, is a useless system call. I can never use that system call, because I don't know whether or not it will work.

Chinner said that the test is whether it is possible to tell whether a file has data in it without calling read() on it. But Darrick Wong, hardly a filesystem amateur, replied: "I don't know how to do that, Dave. :)" There is another fun twist, as Boichat pointed out: files in sysfs, rather than indicating a zero length, claim to be 4,096 bytes long — regardless of their true length, which may be larger or smaller than that. Chinner's test will fail on those files, even if it can be reliably carried out.

Toward a real fix

Wong went on to express agreement with the Go developers: copy_file_range() should either work as expected or return an error so that user space can know to fall back to copying the old-fashioned way. He also suggested a couple of ways to possibly fix the problem, the first of which was to go back to the previous state of affairs, where cross-filesystem copies were explicitly disallowed. Failing that, one could restrict such copies to a single filesystem type that has explicit support for them. Luis Henriques implemented a variant of that idea, where copies across filesystems would still be allowed if the two filesystems were of the same type, and if the filesystem involved explicitly implements the copy_file_range() operation.

That patch was stopped in its tracks, though, after Trond Myklebust pointed out that the kernel's NFS daemon uses the copy_file_range() mechanism to copy files between filesystems of different types. Blocking that would break some important functionality; this usage pattern exists in other filesystems, such as Ceph and FUSE, as well. In response to that, Henriques added a new flag (COPY_FILE_SPLICE) that could be used within the kernel to indicate that a cross-filesystem-type copy should be performed. There was some question of whether this flag should be made available to user space for cases when it somehow knows that the operation would succeed, but it seems that will not happen.

A final version of this patch has not been posted as of this writing, but the eventual shape of the fix seems clear. When called from user space, copy_file_range() will only try to copy a file across filesystems if the two are of the same type, and if that filesystem has explicit support for the system call (and, thus, is presumably written with all of the possible cases in mind). Otherwise, the call will fail with an explicit error, so user space will know that it must copy the data some other way. So copy_file_range() will never be a generic file-copy mechanism, but it will at least be possible to use robustly in code that is prepared for it to fail.

There is still one more trap lurking within copy_file_range(), though. Like most I/O-related system calls, copy_file_range() can copy fewer bytes than requested; user space needs to check the return value to see how much work was actually done. There is currently no way to distinguish between copies that were cut short on the read side (by hitting the end of the file, perhaps) and those that were stopped on the write side (which may well indicate a write error). Nobody has come up with a real solution to that problem yet.

All of this goes to show how a seemingly simple interface can quickly become complex. copy_file_range() has revealed a number of sharp edges over its relatively short existence; there may well be more yet to be found. It is thus perhaps unsurprising that the kernel developers, having been burned more than once, feel a strong desire to keep its implementation as simple as possible.

Index entries for this article
Kernel	System calls/copy_file_range()

How useful should copy_file_range() be?

Posted Feb 18, 2021 15:44 UTC (Thu) by dullfire (guest, #111432) [Link] (2 responses)

I'm not sure adding the ability to report if the sort copy was due to the read or the write is really worth while.

However if that was really desired: I imagine a fairly simple way to do that is: add a flag, something like CFR_UPDATE_OFF_SIZE. Setting it would make the kernel update the loff_t's (pointed to by off_in and off_out) with the bytes read/written correspondingly. Userspace can easily tell which side failed then. If the two sides are equal, it was a read size failure, if the read side is greater, it was a write failure. Note that it is a logic error is the write size were to be greater than the read size.

How useful should copy_file_range() be?

Posted Feb 18, 2021 15:48 UTC (Thu) by dullfire (guest, #111432) [Link] (1 responses)

Never mind, failed to re-read the man page before commenting. Seems the kernel already does this.

So either the author was mistaken and it's not actually an issue, or the kernel doesn't update the read-size loff_t when there is a write failure, in which case the flag would just change this behavior

How useful should copy_file_range() be?

Posted Feb 18, 2021 16:43 UTC (Thu) by matthias (subscriber, #94967) [Link]

As I understand the man page, the offsets are adjusted by the number of bytes copied. So if you read n bytes and only write m<n bytes, then both offsets are adjusted by m.

How useful should copy_file_range() be?

Posted Feb 18, 2021 16:36 UTC (Thu) by zuzzurro (subscriber, #61118) [Link] (8 responses)

Isn't the problem caused by the fact that we have in the systems files that pretend to be files but don't really behave like ones? If that's the case, in the good old days people would create a new file type and flag them as such (p for named pipes, m for multiplexed files, s for sockets). So that useland would not assume that the normal file behaviour was available.
I guess it's too late to do the same anymore...

How useful should copy_file_range() be?

Posted Feb 18, 2021 16:41 UTC (Thu) by zuzzurro (subscriber, #61118) [Link] (6 responses)

What if these "non seekable" files were simply flagged as named pipes?

How useful should copy_file_range() be?

Posted Feb 18, 2021 18:38 UTC (Thu) by smurf (subscriber, #17840) [Link] (5 responses)

Some of these /proc files are quasi-seekable IIRC, in that they return an offset which indicates your read position but which does not correspond to the character count from beginning-of-file to wherever you were when you called lseek(fd,SEEK_CUR).

How useful should copy_file_range() be?

Posted Feb 20, 2021 12:16 UTC (Sat) by jengelh (guest, #33263) [Link] (4 responses)

Looking at the POSIX description for the lseek(2) POSIX-C function, it is required to operate in terms of bytes.

I am not getting that with the /proc/self/sched file (uses `seq_lseek`, which operates in terms of records). Is sys_llseek *meant* to be POSIX-compatible?
If yes, it's a kernel bug.
If no, then it's a libc bug because it failed to provide/emulate POSIX semantics on top of an (unposixy) kernel interface.
So, which is it, where should I file a bug?

How useful should copy_file_range() be?

Posted Feb 20, 2021 14:45 UTC (Sat) by smurf (subscriber, #17840) [Link]

Those fake kernel files don't have POSIX semantics, period. Their reported size doesn't correspond to the number of readable bytes, if they're writeable you can't just write '1' and then '23' when you intend to write '123', and so on.

So the bug is on the user. If you treat these things like ordinary files and expect all the posicky corner cases to work "correctly", you're SOL. These files will never have POSIX semantics. No, you can't use libc to emulate it. Deal.

Yes there should be a way to ask the kernel whether a file conforms to 100% posix. Well, we don't have that. Deal.

One possible workaround is to check the file size. If it's smaller than pagesize*4 or so then it's probably cheaper to copy its data the old-fashioned way anyway.

How useful should copy_file_range() be?

Posted Feb 20, 2021 17:01 UTC (Sat) by markh (subscriber, #33984) [Link] (2 responses)

The bug is that the kernel is reporting it as a regular file (in st_mode), but then does not satisfy the requirements for regular files. If the kernel does not want to satisfy those requirements, all that is needed is to report it as a type of file that does not have the requirements that it cannot satisfy. It is not reasonable to expect userspace programs to somehow guess that a file reported as a regular file cannot be relied upon to behave as such.

How useful should copy_file_range() be?

Posted Feb 23, 2021 19:02 UTC (Tue) by jsmith45 (guest, #125263) [Link] (1 responses)

Very true. The problem is that there is almost certainly a bunch of programs out there that will break if the file types of /proc pseudofiles changes to be anything but a normal file.

How useful should copy_file_range() be?

Posted Feb 25, 2021 11:12 UTC (Thu) by zuzzurro (subscriber, #61118) [Link]

Given the amount of confusion that is visible in the thread where kernel programmers are trying to find the best way to fix this I find it sad that people have tried to just throw the issue off to the userland.

How useful should copy_file_range() be?

Posted Feb 19, 2021 17:33 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

If they are unwilling to flag the filesystem as a whole as FS_GENERATED_CONTENT, I can't imagine they will be eager to flag *each individual file* with something else.

How useful should copy_file_range() be?

Posted Feb 18, 2021 17:31 UTC (Thu) by dezgeg (subscriber, #92243) [Link]

There were objections to the FS_GENERATED_CONTENT flag due to "madness and constant auditing"... but how many such virtual filesystem types that need the flag actually exist, besides the 4 touched in the patchset? Is that really a number that one cannot count with fingers?

How useful should copy_file_range() be?

Posted Feb 18, 2021 18:24 UTC (Thu) by Deewiant (subscriber, #97394) [Link] (12 responses)

Rust's standard library has been using copy_file_range for years. Though apparently a fix for these kinds of issues landed only six months ago: https://github.com/rust-lang/rust/commit/4ddedd521418d67e...

How useful should copy_file_range() be?

Posted Feb 18, 2021 19:21 UTC (Thu) by JoeBuck (subscriber, #2330) [Link] (3 responses)

Looks like others can just port the Rust fix.

How useful should copy_file_range() be?

Posted Feb 18, 2021 22:37 UTC (Thu) by drinkcat (subscriber, #106553) [Link] (2 responses)

That workaround should work in most cases. But another tricky thing with copy_file_range is that in case of partial writes, it's supposed to be able to seek in the input file (which is not usually possible on generated files).

How useful should copy_file_range() be?

Posted Feb 18, 2021 23:49 UTC (Thu) by JoeBuck (subscriber, #2330) [Link] (1 responses)

Perhaps I'm missing something, but I thought that for the generated files, the call always transfers 0 bytes, and the Rust patch immediately falls back when it sees this. So how is seeking an issue?

How useful should copy_file_range() be?

Posted Feb 19, 2021 0:17 UTC (Fri) by drinkcat (subscriber, #106553) [Link]

Yeah... except for sysfs files that report a size of 4096 bytes, copy_file_range would appear to work on these.

How useful should copy_file_range() be?

Posted Feb 19, 2021 7:14 UTC (Fri) by fw (subscriber, #26023) [Link] (7 responses)

The Go implementation works on file descriptors, not files. Full userspace emulation of copy_file_range is very hard: append-only output files, non-seekable input files that cannot restore the correct input read position after an output failure, other error conditions that are not recoverable because system calls fail during rollback. If it's difficult for the kernel, it's probably hard for userspace, too.

If you can just close the descriptors and report an error (because the function opened them locally), these issues do not apply.

How useful should copy_file_range() be?

Posted Feb 19, 2021 7:32 UTC (Fri) by Deewiant (subscriber, #97394) [Link] (6 responses)

The current code in Rust does seem to handle arbitrary fds too. It's all here, but there's a lot of code to read through (which supports your point that it's difficult): https://github.com/rust-lang/rust/blob/7647d03c33339bd85a...

Looks like it starts by checking the underlying file type (with a stat() if necessary) and only tries copy_file_range on regular files of nonzero size (lines 122 and 168; and 284 and 469 for the logic leading to the stat() itself), while still falling back to other methods if copy_file_range only copied zero bytes (lines 175 and 563). Overall there seems to be a lot of logic around keeping track of what was actually written vs. what was reported by the syscalls.

How useful should copy_file_range() be?

Posted Feb 19, 2021 18:55 UTC (Fri) by fw (subscriber, #26023) [Link]

I see, there is a different code path that reaches the copy_file_range system call.

On the other hand, it is still deeply nested within the library, and it is not immediately obvious whether the callers of the universal copy routines expect consistent file offsets on errors. For a function that is directly modeled on the system call (which seems to be the case for Go), predictable file offset behavior seems quite important.

How useful should copy_file_range() be?

Posted Feb 19, 2021 20:29 UTC (Fri) by the8472 (guest, #144969) [Link] (4 responses)

You're looking at the linux implementation for io::copy which can copy arbitrary readers to writers but specializes to various syscalls when types wrapping file descriptors of unknown type are passed, hence the complexity.

There also is fs::copy which shares some code but has fewer cases to cover and does fewer checks before invoking copy_file_range since it's only meant to copy entire regular files.

The relevant parts are here and here

. Note that it doesn't use stat information to decide what to do in the fs::copy case, it just tries copy_file_range and then falls back in various cases, including 0 byte reads.

How useful should copy_file_range() be?

Posted Jul 10, 2021 12:57 UTC (Sat) by jaykrell (guest, #153190) [Link] (3 responses)

The bug is that /proc exists at all.
This data should be retrieved through strongly typed special purpose function calls.
Or at least, perhaps, allow open, for a hierarchical namespace, but not read, and reveal all information via ioctl.

Not everything is a file!

In fact, most things are not a file.
Is the screen a file?
Is the keyboard a file?
Are sockets files? They have read/write, but how about seek and mmap?

How useful should copy_file_range() be?

Posted Jul 10, 2021 13:54 UTC (Sat) by mpr22 (subscriber, #60784) [Link]

Those things are not files, but allowing userspace to treat them as file-like for the simplest common use case (sequential I/O) is one of the things that contributed to Unix eating most of the rest of the server operating system industry's breakfast, lunch, dinner, and face.

How useful should copy_file_range() be?

Posted Jul 10, 2021 21:48 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

> Or at least, perhaps, allow open, for a hierarchical namespace, but not read, and reveal all information via ioctl.
Windows NT tried that. It failed miserably.

How useful should copy_file_range() be?

Posted Jul 10, 2021 23:16 UTC (Sat) by flussence (guest, #85566) [Link]

>Is the screen a file?
>Is the keyboard a file?

You're being facetious, but it's occasionally very useful to be able to do things like check which port a monitor is plugged in on or framedump the console on a server using only ssh.