The Linux "copy problem"

By Jake Edge
May 29, 2019

In a filesystem session on the third day of the 2019 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM), Steve French wanted to talk about copy operations. Much of the development work that has gone on in the Linux filesystem world over the last few years has been related to the performance of copying files, at least indirectly, he said. There are still pain points around copy operations, however, so he would like to see those get addressed.

The "copy problem" is something that has been discussed at LSFMM before, French said, but things have gotten better over the last year due to efforts by Oracle, Amazon, Microsoft, and others. Things are also changing for copy operations; many of them are done to and from the cloud, which has to deal with a wide variation in network latency. At the other end, NVMe is making larger storage faster at a relatively affordable price. Meanwhile virtualization is taking more CPU, at times, because operations that might have been offloaded to network hardware are being handled by the CPU.

But copying files is one of the simplest, most intuitive operations for users; people do it all the time. He made multiple copies of his presentation slides in various locations, for example. Some of the most common utilities used are rsync, which is part of the Samba tools, scp from OpenSSH, and cp from the coreutils.

The source code for cp is "almost embarrassingly small" at around 4K lines of code; scp is about the same and rsync is somewhat larger. They each have to deal with some corner cases as well. He showed some examples of the amount of time it takes to copy files on Btrfs and ext4 on two different drives attached to his laptop, one faster and one slower. On the slow drive with Btrfs, scp took almost five times as long as cp for a 1GB copy. On the fast drive, for a 2GB copy on ext4, cp took 1.2s (1.7s on the slow), scp 1.7s (8.4s), and rsync took 4.3s (not run on the slow drive, apparently). These represent "a dramatic difference in performance" for a "really stupid" copy operation

The I/O size for cp is 128K and the others use 16K, which explains some of the difference, he said. These copies are all going through the page cache, which is kind of odd because you don't normally need the data you just copied again. None of the utilities uses O_DIRECT, if they did there would an improvement in the performance of a few percent, he said. Larger I/O sizes would also improve things.

There are alternative tools that make various changes to improve performance. For example, parcp and parallel parallelize copy operations. In addition, fpart and fpsync can parallelize the operations that copy directories. Beyond that, Mutil is a parallel copy that is based on the cp and md5sum code from coreutils; it comes out of a ten-year old paper [PDF] covering some work that NASA did on analyzing copy performance because the agency found Linux cp to be lacking. The code never went upstream, however, so it can't even be built at this point, French said.

Cluster environments and network filesystems would rather have the server handle the copies directly using copy offload. Cloud providers would presumably prefer to have their backends handle copy operations rather than have them done directly from clients, he said. Also parallelization is a need because the common tools overuse one processor rather than spreading the load, especially if you are doing any encryption. In addition, local cross-mount copies are not being done efficiently; he believes that Linux filesystems could do a much better job in the kernel than cp does in user space even if they were copying between two different mounted filesystems of the same type.

Luis Chamberlain asked if French had spoken about these issues at conferences that had more of a user-space focus. The problems are real, but it is not up to kernel developers to fix them, he said. In addition, any change to parallelize, say, cp would need to continue to operate serially in the default case for backward compatibility. In the end, these are user-space problems, Chamberlain said.

In the vast majority of cases for the open-source developers of these tools, it is the I/O device that is the bottleneck, Ted Ts'o said. If you have a 500-disk RAID array, parallel cp makes a lot of sense, but the coreutils developers are not using those kinds of machines. Similarly, "not everyone is bottlenecked on the network"; those who are will want copy offload. More progress will be made by targeting specific use cases, rather than some generic "copy problem", since there are many different kinds of bottlenecks at play here.

French said that he strongly agrees with that. The problem is that when people run into problems with copy performance on SMB or NFS, they contact him. Other types of problems lead users to contact developers of other kernel filesystems. For example, he said he was tempted to track down a Btrfs developer when he was running some of his tests that took an inordinate amount of time on that filesystem.

Chris Mason said that if there are dramatically different results from local copies on the same device using the standard utilities, it probably points to some kind of bug in buffered I/O. The I/O size should not make a huge difference as the readahead in the kernel should keep the performance roughly the same. French agreed but said that the copy problem in Linux is something that is discussed in multiple places. For example, Amazon has a day-long tutorial on copying for Linux, he said; "it's crazy". This is a big deal for many, "and not just local filesystems and not just clusters".

There are different use cases, some are trying to minimize network bandwidth, others are trying to reduce CPU use, still others have clusters that have problems with metadata updates. The good news is that all of these problems have been solved, he said, but the bad news is that the developers of cp, parcp, and others do not have the knowledge that the filesystems developers have, so they need advice.

Though there are some places "where our APIs are badly broken", he said. For example, when opening a file and setting a bunch of attributes, such as access control lists (ACLs), there are races because those operations cannot be done atomically. That opens security holes.

There are some things he thinks the filesystems could do. For example, Btrfs could support copy_file_range(); there are cases where Btrfs knows how to copy faster and, if it doesn't, user space can fall back to what it does today. There are five or so filesystems in the kernel that support copy_file_range() and Btrfs could do a better job with copies if this copy API is invoked; Btrfs knows more about the placement of the data and what I/O sizes to use.

Metadata is another problem area, French said. The race on setting ACLs is one aspect of that. Another is that filesystem-specific metadata may not get copied as part of a copy operation, such as file attributes and specialized ACLs. There is no API for user space to call that knows how to copy everything about a file; Windows and macOS have that, though it is not the default.

Ts'o said that shows a need for a library that provides ways to copy ACLs, extended attributes (xattrs), and the like. Application developers have told him that they truncate and rewrite files because "they are too lazy to copy the ACLs and xattrs", but then complain when they lose their data if the machine crashes. The solution is not a kernel API, he said.

But French is concerned that some of the xattrs have security implications (e.g. for SELinux), so he thinks the filesystems should be involved in copying them. Ts'o said that doing so in the kernel actually complicates the problem; SELinux is set up to make decisions about what the attributes should be from user space, doing it in the kernel is the wrong place. Mason agreed, saying there is a long history with the API for security attributes; he is "not thrilled" with the idea of redoing that work. He does think that there should be a way to create files with all of their attributes atomically, however.

There was more discussion of ways to possibly change the user-space tools, but several asked for specific ideas of what interfaces the kernel should be providing to help. French said that one example would be to provide a way to get the recommended I/O size for a file. Right now, the utilities base their I/O size on the inode block size reported for the file; SMB and NFS lie and say it is 1MB to improve performance.

But Mason said that the right I/O size depends on the device. Ts'o said that the st_blksize returned from stat() is the preferred I/O size according to POSIX; "whatever the hell that means". Right now, the filesystem block size is returned in that field and there are probably applications that use it for that purpose so a new interface is likely needed to get the optimal I/O size; that could perhaps be added to statx(). But if a filesystem is on a RAID device, for example, it would need to interact with the RAID controller to try to figure out the best I/O size; often the devices will not provide enough information so the filesystem has to guess and will get it wrong sometimes. That means there will need to be a way to override that value via sysfs.

Another idea would be to give user space a way to figure out if it makes sense to turn off the page cache, French said. But that depends on what is going to be done with the file after the copy, Mason said; if you are going to build a kernel with the copied files, then you want that data in the page cache. It is not a decision that the kernel can help with.

The large list of copy tools with different strategies is actually a good motivation not to change what the kernel does, Mason said. User space is the right place to have different policies for how to do a copy operation. French said that 90% of the complaints he hears are about cp performance. Several in the discussion suggested that volunteers or interns be found to go fix cp and make it smarter, but French would like to see the filesystem developers participate in developing the tools or at least advising those developers. Mason pointed out that kernel developers are not ambassadors to go fix applications across the open-source world, however; "our job is to build the interfaces", so that is where the focus of the discussion should be.

As the session closed, French said that Linux copy performance was a bit embarrassing; OS/2 was probably better for copy performance in some ways. But he did note that the way sparse files were handled using FIEMAP in cp was great. Ts'o pointed out that FIEMAP is a great example of how the process should work. Someone identified a problem, so kernel developers added a new feature to help fix it, and now that code is in cp; that is what should be happening with any other kernel features needed for copy operations.

Index entries for this article
Kernel	Block layer
Kernel	System calls/copy_file_range()
Conference	Storage, Filesystem, and Memory-Management Summit/2019

The Linux "copy problem"

Posted May 29, 2019 19:57 UTC (Wed) by desbma (guest, #118820) [Link] (5 responses)

No mention of the sendfile system call?
It does not work for every case, but when it does, it is much more efficient than doing chunk based copy, and it solves the "what is the optimal I/O size" dilemma.

The Linux "copy problem"

Posted May 29, 2019 21:38 UTC (Wed) by ewen (subscriber, #4772) [Link]

The sendfile system call was my thought too. If a copy tool used sendfile, eg, for within a file system copying then the kernel has lots more information on the high level goal with which to optimise the approach taken. And even between file systems it could potentially make use of file system internal knowledge (block layout, etc) and buffer access.

If copy tools just do read/write in a loop, the kernel is left guessing the high level intent (including read ahead and whether to cache it in the kernel buffers),

Ewen

The Linux "copy problem"

Posted May 30, 2019 13:22 UTC (Thu) by ecree (guest, #95790) [Link] (3 responses)

I was wondering about splice(). Wouldn't that be a more efficient way to do copies?

The Linux "copy problem"

Posted May 30, 2019 13:38 UTC (Thu) by desbma (guest, #118820) [Link] (2 responses)

According to `man splice`:
> splice() moves data between two file descriptors [...] where one of the file descriptors must refer to a pipe

sendfile does not have this restriction (it can also work if the destination fd is a socket), so it is the ideal candidate for copying files.

Apparently, someone proposed to update coreutil's cp to use sendfile in 2012, but it was rejected: https://lists.gnu.org/archive/html/coreutils/2012-10/msg0...

A while ago I did some benchmarks in Python to compare "read/write chunk" vs "sendfile" based copy, and it let to a 30-50% speedup : https://github.com/desbma/pyfastcopy#performance

The Linux "copy problem"

Posted May 30, 2019 13:54 UTC (Thu) by ecree (guest, #95790) [Link] (1 responses)

> > one of the file descriptors must refer to a pipe
Yes, so you make two splice calls:

int p[2];
pipe(p);
splice(fd_in, NULL, p[1], NULL, len, flags);
splice(p[0], NULL, fd_out, NULL, len, flags);

I haven't actually tried this, but in theory it should enable the kernel to do a zero-copy copy where the underlying files support that. The pipe is really no more than a way to associate a userspace handle with a kernel buffer; see https://yarchive.net/comp/linux/splice.html for details.

The Linux "copy problem"

Posted May 30, 2019 14:00 UTC (Thu) by desbma (guest, #118820) [Link]

> I haven't actually tried this, but in theory it should enable the kernel to do a zero-copy copy where the underlying files support that.

This is exactly what sendfile does, with a single system call, instead of 3 for your example.

The Linux "copy problem"

Posted May 29, 2019 20:20 UTC (Wed) by roc (subscriber, #30627) [Link] (15 responses)

> Mason pointed out that kernel developers are not ambassadors to go fix applications across the open-source world, however; "our job is to build the interfaces", so that is where the focus of the discussion should be.

It seems like it would be worth at least diving into `cp` and making sure it's as good as can be, since that gets used so much. Then you can point developers of those other applications at `cp` as an example of how to do things right.

The Linux "copy problem"

Posted May 29, 2019 21:05 UTC (Wed) by smfrench (subscriber, #124116) [Link] (7 responses)

In the presentation I listed seven options that could be added (e.g. to cp and rsync). Other copy tools (like robocopy for Windows) have these (as well as others that may be less important for us on Linux) and may be useful examples.

For example some options which other tools like robocopy let the user select:
- parallel i/o (especially for the uncached copy case)
- allow setting file size first (to reduce the number of metadata updates during the copy operation)
- allow calling the copy system call (copy_file_range API) for file systems which support it
- allow copying additional metadata (e..g xattr and ACLs)
- allow choosing larger i/o (overriding the block size). For some filesystems i/o > 1MB can be much faster than small I/O (some tools will default to 4K or smaller which can be more than 10 times slower)

And then following up on other discussions at the sumimt:
- allow options like encryption or compression (which could be supported over SMB3 for example and probably other filesystems).

The Linux "copy problem"

Posted May 29, 2019 23:05 UTC (Wed) by roc (subscriber, #30627) [Link]

That makes sense, but you also want to the default to be as good as can be.

The Linux "copy problem"

Posted May 30, 2019 16:20 UTC (Thu) by boutcher (subscriber, #7730) [Link]

I had to laugh that you brought up OS/2

The Linux "copy problem"

Posted Jun 1, 2019 1:40 UTC (Sat) by tarkasteve (subscriber, #94934) [Link] (4 responses)

I'd also humbly suggest `xcp`:

https://crates.io/crates/xcp

* Uses copy_file_range() where possible, falls back to userspace if not.
* Supports sparse files (with lseek; I wasn't aware of fiemap, is there any advantage to one over the other?)
* Partially parallel (recursive read is separate from copy operations; I have an todo for parallel copy as it seems to have advantages on nvme drives).
* Optional progress bar.
* Written in Rust
* Cross platform (well, Linux + other unix-like OSs; Windows may work, I've never managed to get Rust to work on it).

It doesn't support much in the way of permissions/ACLs ATM, it's still an intermittent WIP.

I did look at using O_DIRECT, but I get EINVAL. The open manpage lists a whole series of caveats and warnings about using it, including a disparaging quote from Linus.

Thanks for the discussion/article, it's given me some things to look into.

The Linux "copy problem"

Posted Jun 1, 2019 13:04 UTC (Sat) by desbma (guest, #118820) [Link] (1 responses)

Thanks for the link.

It joins the list of great little tools that have taken inspiration from classic Unix command line tools, but rewritten them in Rust with many improvements along the way: grep -> ripgrep, find -> fd, hexdump -> hexyl, cat -> bat, du -> diskus, cloc -> tokei...

I'll be sure to look into xcp, and probably open a few issues along the way :)

The Linux "copy problem"

Posted Jun 2, 2019 3:02 UTC (Sun) by scientes (guest, #83068) [Link]

I myself was using inotail until I reported the problem (tail -f didn't support inotify) to coreutils and it was actually fixed.

The Linux "copy problem"

Posted Jun 2, 2019 5:12 UTC (Sun) by tarkasteve (subscriber, #94934) [Link] (1 responses)

So inspired by all this, I've updated xcp with the ability to do parallel copies (at the per-file level). The results are fairly good; I'm seeing 30%-60% speed-ups depending on caching.

The Linux "copy problem"

Posted Jun 10, 2019 21:58 UTC (Mon) by smfrench (subscriber, #124116) [Link]

This is great news - looking forward to trying it. Am also very excited about the work Andreas at RedHat did, enabling GCM crypto for SMB3.1.1 mounts, which can more than double performance copying files to server when on encrypted mounts (in conjunction with two cifs.ko client patches that I recently merged into for-next that enable GCM on the client).

The Linux "copy problem"

Posted May 30, 2019 7:29 UTC (Thu) by k3ninho (subscriber, #50375) [Link]

I had a range of responses to this 'build the interfaces' thing.

How people use those interfaces is something that needs guidance, always: you form a symbiotic loop in reliance on each other.

Conway's Law hasn't gone away and is implications about interacting with the users of your interfaces still stand. I think that Chris Mason's comment here is misguided and that we need to help people work together and communicate better. I think we need to balance this view of APIs with criticism of xattrs being difficult to copy and racy (with security implications).

K3n.

The Linux "copy problem"

Posted May 30, 2019 16:00 UTC (Thu) by KAMiKAZOW (guest, #107958) [Link] (5 responses)

> It seems like it would be worth at least diving into `cp` and making sure it's as good as can be, since that gets used so much.

I find it insane that cp was fixed a decade ago by NASA and in all that time neither them nor anybody else thought about upstreaming the changes.

The Linux "copy problem"

Posted May 30, 2019 16:17 UTC (Thu) by desbma (guest, #118820) [Link] (4 responses)

Unfortunately, there are many other example of cases like this.

For example zlib, one the most widely used software library in the world, has several forks (Intel, Cloudflare, zlib-ng...) with optimizations that improve compression/decompression speed.

Yet the changes have never been merged back in zlib, and everybody still uses the historic version, and happily wastes CPU cycles (including when your browser decompresses this very page).

The Linux "copy problem"

Posted Jun 2, 2019 3:07 UTC (Sun) by scientes (guest, #83068) [Link] (3 responses)

> (including when your browser decompresses this very page).

Compression is disabled for https sites due to various attacks on the file size information leak.

The Linux "copy problem"

Posted Jun 2, 2019 10:27 UTC (Sun) by desbma (guest, #118820) [Link] (2 responses)

My browser (and curl) both disagree with you:
curl -v --compressed 'https://lwn.net/' > /dev/null 2>&1 | grep gzip
> Accept-Encoding: deflate, gzip
< Content-Encoding: gzip

The Linux "copy problem"

Posted Jun 2, 2019 12:20 UTC (Sun) by Jandar (subscriber, #85683) [Link] (1 responses)

> curl -v --compressed 'https://lwn.net/' > /dev/null 2>&1 | grep gzip

This command is obviously without any output.

$ curl -v --compressed 'https://lwn.net/' > /dev/null 2>&1 | wc
0 0 0

Perhaps you meant: curl -v --compressed 'https://lwn.net/' 2>&1 > /dev/null | grep gzip

The Linux "copy problem"

Posted Jun 2, 2019 12:45 UTC (Sun) by desbma (guest, #118820) [Link]

You are right, I'm using ZSH and didn't realize that line was not portable across other shells.

curl -v --compressed 'https://lwn.net/' -o /dev/null 2>&1 | grep gzip

also works

The Linux "copy problem"

Posted May 30, 2019 10:35 UTC (Thu) by jezuch (subscriber, #52988) [Link] (6 responses)

I don't cp often but when I do I cp --reflink

The Linux "copy problem"

Posted May 31, 2019 9:33 UTC (Fri) by LtWorf (subscriber, #124958) [Link] (2 responses)

Some of us don't use experimental filesystems.

The Linux "copy problem"

Posted May 31, 2019 14:44 UTC (Fri) by jhoblitt (subscriber, #77733) [Link] (1 responses)

Doesn't XFS has reflink() support?

The Linux "copy problem"

Posted May 31, 2019 15:46 UTC (Fri) by rahulsundaram (subscriber, #21946) [Link]

Yes. Enabled by default in recent Fedora and RHEL 8 as well.

The Linux "copy problem"

Posted Jun 16, 2019 17:40 UTC (Sun) by nevyn (guest, #33129) [Link] (2 responses)

Why is this not the default?

The Linux "copy problem"

Posted Jun 20, 2019 6:51 UTC (Thu) by jezuch (subscriber, #52988) [Link] (1 responses)

I think people were concerned with loss of redundancy. Some claimed that they make a copy because they want to *really* have an actual copy and making it COW by default is surprising to the user by basically changing the contract of the command.

That's how I remember it at least.

Anyway, when I copy something it's mostly because I screwed up and I am restoring things from a btrfs backup snapshot. Making a real copy would cause unnecessary duplication in the backup - and would be unnecessarily slow and drive-thrashing. But that's just my use case.

The Linux "copy problem"

Posted Jun 24, 2019 14:28 UTC (Mon) by nix (subscriber, #2304) [Link]

Reflinking does, inevitably, reduce the fragmentation-resistance of the filesystem if you use it a lot and tend to keep the post-reflinked files for longer than the pre-reflinked ones. (That's the whole point, after all: the new blocks are necessarily going to be fragmented wrt the old ones, even if this is reduced by writing more than strictly needed in the CoW phase.)

Ideally, a background preen phase would rewrite such things after a while so they are non-fragmented again. I guess one could call xfs_fsr from cron to do that... on an SSD this matters not at all, of course, and not very much if there's a caching layer like LVM's or bcache in the way either. But it's a real concern on unintermediated spinning rust.

The copy problem is really the backup problem

Posted May 30, 2019 15:01 UTC (Thu) by mcr (subscriber, #99374) [Link] (3 responses)

In the old says of Unix, with a single file system, we used "dump" to get a good copy. It had all sorts ridiculous issues of having a userspace program trying to decipher file system contents from raw reads of the disk. On the other hand, when it worked, it got all the metadata, did it without destroying the buffer cache, and often was able to backup disks which were in the process of dying. When it failed, it failed, and the backups were sometimes garbage. And it didn't work for many things. So people mostly use tar for backup. And that's should be the most common copy problem, which is not just about data centers or cluster environments. And tar fails for any file system that does something innovative.
My claim is that our VFS layer is incomplete: it should include an atomic backup and an atomic restore operation, at least on a file level, but optionally on a directory basis. If we had that, then cp would always usefully be backup file | restore file2. This means that file systems have to serialize file contents and meta data, and have to deserialize it too. We Linux a microkernel architecture, then probably much of this deserialization could be done in some system-provided, non-ring0 context. Should we pick tar for serialization, or something more modern like CBOR, that's a bike shed for a design team.
I would just be happy if we could agree that we need this functionality.

The copy problem is really the backup problem

Posted Jun 4, 2019 9:18 UTC (Tue) by jezuch (subscriber, #52988) [Link] (2 responses)

At least on btrfs that's:

btrfs subvolume snapshot -r
btrfs send
btrfs receive

But it does not work on per-file basis, unfortunately. And yes, btrfs defines its own serialization format.

The copy problem is really the backup problem

Posted Jun 4, 2019 14:06 UTC (Tue) by mcr (subscriber, #99374) [Link] (1 responses)

What happens if a file is open (TXTBUSY)? or open O_DIRECT, or any of these other things that might be mutually exclusive with regular I/O?

The copy problem is really the backup problem

Posted Jun 19, 2019 21:32 UTC (Wed) by nix (subscriber, #2304) [Link]

btrfs send and receive are not regular I/O, so they work fine. (Though I'm not sure what happens in conjunction with O_DIRECT, which is a bit... hard to grasp the semantics of on a CoW filesystem in any case.)

(You don't get -ETXTBSY if you read a file in any case, only if you try to modify it.)

The Linux "copy problem"

Posted May 30, 2019 18:56 UTC (Thu) by jmgao (guest, #104246) [Link]

> He does think that there should be a way to create files with all of their attributes atomically, however.

Isn't there already a solution to this? You can open a file on the filesystem you want to create the file on with O_TMPFILE to hide it until you've done all of your attribute twiddling, do your fsetfilecon, fsetxattr, etc. on the file descriptor, and then use linkat to put it into place.

The Linux "copy problem"

Posted Jun 1, 2019 16:11 UTC (Sat) by jmclnx (guest, #72456) [Link] (1 responses)

Speeding up things is always good, but the one thing I have noticed in the last few years. Once in a great while, if I cp a large file from one local fs to another local fs, the whole system would freeze for multiple 10s of seconds when a sync happens.

In the early days I never saw this and we use to make fun of Microsoft for having that freeze issue. I would rather have a slower cp and not have that sync freeze :)

The Linux "copy problem"

Posted Jun 1, 2019 21:13 UTC (Sat) by desbma (guest, #118820) [Link]

This is most likely the infamous "dirty writeback" problem, best described in this LWN article: https://lwn.net/Articles/572911/

On my systems, I fix it permanently with:
echo "vm.dirty_background_bytes=$((48 * 1024 * 1024))
vm.dirty_ratio=10" | sudo tee /etc/sysctl.d/99-dirty-writeback.conf
sudo sysctl --system

That way the writeback kicks in with at most 48MB of dirty data, and also stalls the writer process before there is more than 10% of all memory consumed by dirty writeback pages.

The Linux "copy problem"

Posted Feb 1, 2021 13:18 UTC (Mon) by oliwer (subscriber, #40989) [Link]

Note that GNU cp(1) recently added support for copy_file_range(2).
https://git.savannah.gnu.org/cgit/coreutils.git/commit/sr...