The Linux "copy problem"
In a filesystem session on the third day of the 2019 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM), Steve French wanted to talk about copy operations. Much of the development work that has gone on in the Linux filesystem world over the last few years has been related to the performance of copying files, at least indirectly, he said. There are still pain points around copy operations, however, so he would like to see those get addressed.
The "copy problem" is something that has been discussed at LSFMM before, French said, but things have gotten better over the last year due to efforts by Oracle, Amazon, Microsoft, and others. Things are also changing for copy operations; many of them are done to and from the cloud, which has to deal with a wide variation in network latency. At the other end, NVMe is making larger storage faster at a relatively affordable price. Meanwhile virtualization is taking more CPU, at times, because operations that might have been offloaded to network hardware are being handled by the CPU.
![Steve French [Steve French]](https://static.lwn.net/images/2019/lsf-french-sm.jpg)
But copying files is one of the simplest, most intuitive operations for users; people do it all the time. He made multiple copies of his presentation slides in various locations, for example. Some of the most common utilities used are rsync, which is part of the Samba tools, scp from OpenSSH, and cp from the coreutils.
The source code for cp is "almost embarrassingly small" at around 4K lines of code; scp is about the same and rsync is somewhat larger. They each have to deal with some corner cases as well. He showed some examples of the amount of time it takes to copy files on Btrfs and ext4 on two different drives attached to his laptop, one faster and one slower. On the slow drive with Btrfs, scp took almost five times as long as cp for a 1GB copy. On the fast drive, for a 2GB copy on ext4, cp took 1.2s (1.7s on the slow), scp 1.7s (8.4s), and rsync took 4.3s (not run on the slow drive, apparently). These represent "a dramatic difference in performance" for a "really stupid" copy operation
The I/O size for cp is 128K and the others use 16K, which explains some of the difference, he said. These copies are all going through the page cache, which is kind of odd because you don't normally need the data you just copied again. None of the utilities uses O_DIRECT, if they did there would an improvement in the performance of a few percent, he said. Larger I/O sizes would also improve things.
There are alternative tools that make various changes to improve performance. For example, parcp and parallel parallelize copy operations. In addition, fpart and fpsync can parallelize the operations that copy directories. Beyond that, Mutil is a parallel copy that is based on the cp and md5sum code from coreutils; it comes out of a ten-year old paper [PDF] covering some work that NASA did on analyzing copy performance because the agency found Linux cp to be lacking. The code never went upstream, however, so it can't even be built at this point, French said.
Cluster environments and network filesystems would rather have the server handle the copies directly using copy offload. Cloud providers would presumably prefer to have their backends handle copy operations rather than have them done directly from clients, he said. Also parallelization is a need because the common tools overuse one processor rather than spreading the load, especially if you are doing any encryption. In addition, local cross-mount copies are not being done efficiently; he believes that Linux filesystems could do a much better job in the kernel than cp does in user space even if they were copying between two different mounted filesystems of the same type.
Luis Chamberlain asked if French had spoken about these issues at conferences that had more of a user-space focus. The problems are real, but it is not up to kernel developers to fix them, he said. In addition, any change to parallelize, say, cp would need to continue to operate serially in the default case for backward compatibility. In the end, these are user-space problems, Chamberlain said.
In the vast majority of cases for the open-source developers of these tools, it is the I/O device that is the bottleneck, Ted Ts'o said. If you have a 500-disk RAID array, parallel cp makes a lot of sense, but the coreutils developers are not using those kinds of machines. Similarly, "not everyone is bottlenecked on the network"; those who are will want copy offload. More progress will be made by targeting specific use cases, rather than some generic "copy problem", since there are many different kinds of bottlenecks at play here.
French said that he strongly agrees with that. The problem is that when people run into problems with copy performance on SMB or NFS, they contact him. Other types of problems lead users to contact developers of other kernel filesystems. For example, he said he was tempted to track down a Btrfs developer when he was running some of his tests that took an inordinate amount of time on that filesystem.
Chris Mason said that if there are dramatically different results from local copies on the same device using the standard utilities, it probably points to some kind of bug in buffered I/O. The I/O size should not make a huge difference as the readahead in the kernel should keep the performance roughly the same. French agreed but said that the copy problem in Linux is something that is discussed in multiple places. For example, Amazon has a day-long tutorial on copying for Linux, he said; "it's crazy". This is a big deal for many, "and not just local filesystems and not just clusters".
There are different use cases, some are trying to minimize network bandwidth, others are trying to reduce CPU use, still others have clusters that have problems with metadata updates. The good news is that all of these problems have been solved, he said, but the bad news is that the developers of cp, parcp, and others do not have the knowledge that the filesystems developers have, so they need advice.
Though there are some places "where our APIs are badly broken", he said. For example, when opening a file and setting a bunch of attributes, such as access control lists (ACLs), there are races because those operations cannot be done atomically. That opens security holes.
There are some things he thinks the filesystems could do. For example, Btrfs could support copy_file_range(); there are cases where Btrfs knows how to copy faster and, if it doesn't, user space can fall back to what it does today. There are five or so filesystems in the kernel that support copy_file_range() and Btrfs could do a better job with copies if this copy API is invoked; Btrfs knows more about the placement of the data and what I/O sizes to use.
Metadata is another problem area, French said. The race on setting ACLs is one aspect of that. Another is that filesystem-specific metadata may not get copied as part of a copy operation, such as file attributes and specialized ACLs. There is no API for user space to call that knows how to copy everything about a file; Windows and macOS have that, though it is not the default.
Ts'o said that shows a need for a library that provides ways to copy ACLs, extended attributes (xattrs), and the like. Application developers have told him that they truncate and rewrite files because "they are too lazy to copy the ACLs and xattrs", but then complain when they lose their data if the machine crashes. The solution is not a kernel API, he said.
But French is concerned that some of the xattrs have security implications (e.g. for SELinux), so he thinks the filesystems should be involved in copying them. Ts'o said that doing so in the kernel actually complicates the problem; SELinux is set up to make decisions about what the attributes should be from user space, doing it in the kernel is the wrong place. Mason agreed, saying there is a long history with the API for security attributes; he is "not thrilled" with the idea of redoing that work. He does think that there should be a way to create files with all of their attributes atomically, however.
There was more discussion of ways to possibly change the user-space tools, but several asked for specific ideas of what interfaces the kernel should be providing to help. French said that one example would be to provide a way to get the recommended I/O size for a file. Right now, the utilities base their I/O size on the inode block size reported for the file; SMB and NFS lie and say it is 1MB to improve performance.
But Mason said that the right I/O size depends on the device. Ts'o said that the st_blksize returned from stat() is the preferred I/O size according to POSIX; "whatever the hell that means". Right now, the filesystem block size is returned in that field and there are probably applications that use it for that purpose so a new interface is likely needed to get the optimal I/O size; that could perhaps be added to statx(). But if a filesystem is on a RAID device, for example, it would need to interact with the RAID controller to try to figure out the best I/O size; often the devices will not provide enough information so the filesystem has to guess and will get it wrong sometimes. That means there will need to be a way to override that value via sysfs.
Another idea would be to give user space a way to figure out if it makes sense to turn off the page cache, French said. But that depends on what is going to be done with the file after the copy, Mason said; if you are going to build a kernel with the copied files, then you want that data in the page cache. It is not a decision that the kernel can help with.
The large list of copy tools with different strategies is actually a good motivation not to change what the kernel does, Mason said. User space is the right place to have different policies for how to do a copy operation. French said that 90% of the complaints he hears are about cp performance. Several in the discussion suggested that volunteers or interns be found to go fix cp and make it smarter, but French would like to see the filesystem developers participate in developing the tools or at least advising those developers. Mason pointed out that kernel developers are not ambassadors to go fix applications across the open-source world, however; "our job is to build the interfaces", so that is where the focus of the discussion should be.
As the session closed, French said that Linux copy performance was a bit embarrassing; OS/2 was probably better for copy performance in some ways. But he did note that the way sparse files were handled using FIEMAP in cp was great. Ts'o pointed out that FIEMAP is a great example of how the process should work. Someone identified a problem, so kernel developers added a new feature to help fix it, and now that code is in cp; that is what should be happening with any other kernel features needed for copy operations.
Index entries for this article | |
---|---|
Kernel | Block layer |
Kernel | System calls/copy_file_range() |
Conference | Storage, Filesystem, and Memory-Management Summit/2019 |
Posted May 29, 2019 19:57 UTC (Wed)
by desbma (guest, #118820)
[Link] (5 responses)
Posted May 29, 2019 21:38 UTC (Wed)
by ewen (subscriber, #4772)
[Link]
If copy tools just do read/write in a loop, the kernel is left guessing the high level intent (including read ahead and whether to cache it in the kernel buffers),
Ewen
Posted May 30, 2019 13:22 UTC (Thu)
by ecree (guest, #95790)
[Link] (3 responses)
Posted May 30, 2019 13:38 UTC (Thu)
by desbma (guest, #118820)
[Link] (2 responses)
sendfile does not have this restriction (it can also work if the destination fd is a socket), so it is the ideal candidate for copying files.
Apparently, someone proposed to update coreutil's cp to use sendfile in 2012, but it was rejected: https://lists.gnu.org/archive/html/coreutils/2012-10/msg0...
A while ago I did some benchmarks in Python to compare "read/write chunk" vs "sendfile" based copy, and it let to a 30-50% speedup : https://github.com/desbma/pyfastcopy#performance
Posted May 30, 2019 13:54 UTC (Thu)
by ecree (guest, #95790)
[Link] (1 responses)
int p[2];
I haven't actually tried this, but in theory it should enable the kernel to do a zero-copy copy where the underlying files support that. The pipe is really no more than a way to associate a userspace handle with a kernel buffer; see https://yarchive.net/comp/linux/splice.html for details.
Posted May 30, 2019 14:00 UTC (Thu)
by desbma (guest, #118820)
[Link]
This is exactly what sendfile does, with a single system call, instead of 3 for your example.
Posted May 29, 2019 20:20 UTC (Wed)
by roc (subscriber, #30627)
[Link] (15 responses)
It seems like it would be worth at least diving into `cp` and making sure it's as good as can be, since that gets used so much. Then you can point developers of those other applications at `cp` as an example of how to do things right.
Posted May 29, 2019 21:05 UTC (Wed)
by smfrench (subscriber, #124116)
[Link] (7 responses)
For example some options which other tools like robocopy let the user select:
And then following up on other discussions at the sumimt:
Posted May 29, 2019 23:05 UTC (Wed)
by roc (subscriber, #30627)
[Link]
Posted May 30, 2019 16:20 UTC (Thu)
by boutcher (subscriber, #7730)
[Link]
Posted Jun 1, 2019 1:40 UTC (Sat)
by tarkasteve (subscriber, #94934)
[Link] (4 responses)
* Uses copy_file_range() where possible, falls back to userspace if not.
It doesn't support much in the way of permissions/ACLs ATM, it's still an intermittent WIP.
I did look at using O_DIRECT, but I get EINVAL. The open manpage lists a whole series of caveats and warnings about using it, including a disparaging quote from Linus.
Thanks for the discussion/article, it's given me some things to look into.
Posted Jun 1, 2019 13:04 UTC (Sat)
by desbma (guest, #118820)
[Link] (1 responses)
It joins the list of great little tools that have taken inspiration from classic Unix command line tools, but rewritten them in Rust with many improvements along the way: grep -> ripgrep, find -> fd, hexdump -> hexyl, cat -> bat, du -> diskus, cloc -> tokei...
I'll be sure to look into xcp, and probably open a few issues along the way :)
Posted Jun 2, 2019 3:02 UTC (Sun)
by scientes (guest, #83068)
[Link]
Posted Jun 2, 2019 5:12 UTC (Sun)
by tarkasteve (subscriber, #94934)
[Link] (1 responses)
Posted Jun 10, 2019 21:58 UTC (Mon)
by smfrench (subscriber, #124116)
[Link]
Posted May 30, 2019 7:29 UTC (Thu)
by k3ninho (subscriber, #50375)
[Link]
How people use those interfaces is something that needs guidance, always: you form a symbiotic loop in reliance on each other.
Conway's Law hasn't gone away and is implications about interacting with the users of your interfaces still stand. I think that Chris Mason's comment here is misguided and that we need to help people work together and communicate better. I think we need to balance this view of APIs with criticism of xattrs being difficult to copy and racy (with security implications).
K3n.
Posted May 30, 2019 16:00 UTC (Thu)
by KAMiKAZOW (guest, #107958)
[Link] (5 responses)
I find it insane that cp was fixed a decade ago by NASA and in all that time neither them nor anybody else thought about upstreaming the changes.
Posted May 30, 2019 16:17 UTC (Thu)
by desbma (guest, #118820)
[Link] (4 responses)
For example zlib, one the most widely used software library in the world, has several forks (Intel, Cloudflare, zlib-ng...) with optimizations that improve compression/decompression speed.
Yet the changes have never been merged back in zlib, and everybody still uses the historic version, and happily wastes CPU cycles (including when your browser decompresses this very page).
Posted Jun 2, 2019 3:07 UTC (Sun)
by scientes (guest, #83068)
[Link] (3 responses)
Compression is disabled for https sites due to various attacks on the file size information leak.
Posted Jun 2, 2019 10:27 UTC (Sun)
by desbma (guest, #118820)
[Link] (2 responses)
Posted Jun 2, 2019 12:20 UTC (Sun)
by Jandar (subscriber, #85683)
[Link] (1 responses)
This command is obviously without any output.
$ curl -v --compressed 'https://lwn.net/' > /dev/null 2>&1 | wc
Perhaps you meant: curl -v --compressed 'https://lwn.net/' 2>&1 > /dev/null | grep gzip
Posted Jun 2, 2019 12:45 UTC (Sun)
by desbma (guest, #118820)
[Link]
curl -v --compressed 'https://lwn.net/' -o /dev/null 2>&1 | grep gzip
also works
Posted May 30, 2019 10:35 UTC (Thu)
by jezuch (subscriber, #52988)
[Link] (6 responses)
:)
Posted May 31, 2019 9:33 UTC (Fri)
by LtWorf (subscriber, #124958)
[Link] (2 responses)
Posted May 31, 2019 14:44 UTC (Fri)
by jhoblitt (subscriber, #77733)
[Link] (1 responses)
Posted May 31, 2019 15:46 UTC (Fri)
by rahulsundaram (subscriber, #21946)
[Link]
Posted Jun 16, 2019 17:40 UTC (Sun)
by nevyn (guest, #33129)
[Link] (2 responses)
Posted Jun 20, 2019 6:51 UTC (Thu)
by jezuch (subscriber, #52988)
[Link] (1 responses)
That's how I remember it at least.
Anyway, when I copy something it's mostly because I screwed up and I am restoring things from a btrfs backup snapshot. Making a real copy would cause unnecessary duplication in the backup - and would be unnecessarily slow and drive-thrashing. But that's just my use case.
Posted Jun 24, 2019 14:28 UTC (Mon)
by nix (subscriber, #2304)
[Link]
Ideally, a background preen phase would rewrite such things after a while so they are non-fragmented again. I guess one could call xfs_fsr from cron to do that... on an SSD this matters not at all, of course, and not very much if there's a caching layer like LVM's or bcache in the way either. But it's a real concern on unintermediated spinning rust.
Posted May 30, 2019 15:01 UTC (Thu)
by mcr (subscriber, #99374)
[Link] (3 responses)
Posted Jun 4, 2019 9:18 UTC (Tue)
by jezuch (subscriber, #52988)
[Link] (2 responses)
btrfs subvolume snapshot -r
But it does not work on per-file basis, unfortunately. And yes, btrfs defines its own serialization format.
Posted Jun 4, 2019 14:06 UTC (Tue)
by mcr (subscriber, #99374)
[Link] (1 responses)
Posted Jun 19, 2019 21:32 UTC (Wed)
by nix (subscriber, #2304)
[Link]
(You don't get -ETXTBSY if you read a file in any case, only if you try to modify it.)
Posted May 30, 2019 18:56 UTC (Thu)
by jmgao (guest, #104246)
[Link]
Isn't there already a solution to this? You can open a file on the filesystem you want to create the file on with O_TMPFILE to hide it until you've done all of your attribute twiddling, do your fsetfilecon, fsetxattr, etc. on the file descriptor, and then use linkat to put it into place.
Posted Jun 1, 2019 16:11 UTC (Sat)
by jmclnx (guest, #72456)
[Link] (1 responses)
In the early days I never saw this and we use to make fun of Microsoft for having that freeze issue. I would rather have a slower cp and not have that sync freeze :)
Posted Jun 1, 2019 21:13 UTC (Sat)
by desbma (guest, #118820)
[Link]
On my systems, I fix it permanently with:
That way the writeback kicks in with at most 48MB of dirty data, and also stalls the writer process before there is more than 10% of all memory consumed by dirty writeback pages.
Posted Feb 1, 2021 13:18 UTC (Mon)
by oliwer (subscriber, #40989)
[Link]
The Linux "copy problem"
It does not work for every case, but when it does, it is much more efficient than doing chunk based copy, and it solves the "what is the optimal I/O size" dilemma.
The Linux "copy problem"
The Linux "copy problem"
The Linux "copy problem"
> splice() moves data between two file descriptors [...] where one of the file descriptors must refer to a pipe
The Linux "copy problem"
Yes, so you make two splice calls:
pipe(p);
splice(fd_in, NULL, p[1], NULL, len, flags);
splice(p[0], NULL, fd_out, NULL, len, flags);
The Linux "copy problem"
The Linux "copy problem"
The Linux "copy problem"
- parallel i/o (especially for the uncached copy case)
- allow setting file size first (to reduce the number of metadata updates during the copy operation)
- allow calling the copy system call (copy_file_range API) for file systems which support it
- allow copying additional metadata (e..g xattr and ACLs)
- allow choosing larger i/o (overriding the block size). For some filesystems i/o > 1MB can be much faster than small I/O (some tools will default to 4K or smaller which can be more than 10 times slower)
- allow options like encryption or compression (which could be supported over SMB3 for example and probably other filesystems).
The Linux "copy problem"
The Linux "copy problem"
The Linux "copy problem"
* Supports sparse files (with lseek; I wasn't aware of fiemap, is there any advantage to one over the other?)
* Partially parallel (recursive read is separate from copy operations; I have an todo for parallel copy as it seems to have advantages on nvme drives).
* Optional progress bar.
* Written in Rust
* Cross platform (well, Linux + other unix-like OSs; Windows may work, I've never managed to get Rust to work on it).
The Linux "copy problem"
The Linux "copy problem"
The Linux "copy problem"
The Linux "copy problem"
The Linux "copy problem"
The Linux "copy problem"
The Linux "copy problem"
The Linux "copy problem"
The Linux "copy problem"
curl -v --compressed 'https://lwn.net/' > /dev/null 2>&1 | grep gzip
> Accept-Encoding: deflate, gzip
< Content-Encoding: gzip
The Linux "copy problem"
0 0 0
The Linux "copy problem"
The Linux "copy problem"
The Linux "copy problem"
The Linux "copy problem"
The Linux "copy problem"
The Linux "copy problem"
The Linux "copy problem"
The Linux "copy problem"
The copy problem is really the backup problem
My claim is that our VFS layer is incomplete: it should include an atomic backup and an atomic restore operation, at least on a file level, but optionally on a directory basis. If we had that, then cp would always usefully be backup file | restore file2. This means that file systems have to serialize file contents and meta data, and have to deserialize it too. We Linux a microkernel architecture, then probably much of this deserialization could be done in some system-provided, non-ring0 context. Should we pick tar for serialization, or something more modern like CBOR, that's a bike shed for a design team.
I would just be happy if we could agree that we need this functionality.
The copy problem is really the backup problem
btrfs send
btrfs receive
The copy problem is really the backup problem
The copy problem is really the backup problem
The Linux "copy problem"
The Linux "copy problem"
The Linux "copy problem"
echo "vm.dirty_background_bytes=$((48 * 1024 * 1024))
vm.dirty_ratio=10" | sudo tee /etc/sysctl.d/99-dirty-writeback.conf
sudo sysctl --system
The Linux "copy problem"
https://git.savannah.gnu.org/cgit/coreutils.git/commit/sr...