LWN.net Logo

A look at rsync performance

A look at rsync performance

Posted Aug 19, 2010 9:15 UTC (Thu) by Liefting (subscriber, #8466)
In reply to: A look at rsync performance by ewen
Parent article: A look at rsync performance

For bulk copies like this, I tend to use tar over a shell pipe:

tar -cvf - * | (cd /someotherdir; tar -xvf -)

Or, when going through a network, tar in combination with nc:

receiver# nc -l 1234 | tar -xvf -
sender# tar -cvf - * | nc receiver 1234

I wonder how this stacks up against the other methods. At the very least it doesn't seem to be CPU-bound, but disk I/O bound or, in case of the network copy, network-bound.


(Log in to post comments)

A look at rsync performance

Posted Aug 19, 2010 9:55 UTC (Thu) by spaetz (subscriber, #32870) [Link]

> tar -cvf - * | (cd /someotherdir; tar -xvf -

At the danger of becoming off-topic. Why would you do something like that over a simple "cp"? I honestly would be interested in learning why tar is better in that case.

A look at rsync performance

Posted Aug 19, 2010 10:35 UTC (Thu) by dafid_b (guest, #67424) [Link]

when done as root, the tar pipe preserves ownership details etc..

I dont think that is part of cp.

A look at rsync performance

Posted Aug 19, 2010 10:48 UTC (Thu) by spaetz (subscriber, #32870) [Link]

I commonly use "cp -a" which includes --preserve=all: preserve the specified attributes: mode, ownership,time-stamps, context, links, xattr, all

but perhaps that is missing out things that tar manages to preserve. Not sure. Just curious in any case.

A look at rsync performance

Posted Aug 19, 2010 15:08 UTC (Thu) by bronson (subscriber, #4806) [Link]

It's a graybeard thing. 20 years ago, cp would screw up permissions, dates, ownership, symlinks, device files, etc. Different platforms would require different command-line options and then screw up different things. It was insane. Tar, on the other hand, pretty much got it right on every platform.

Nowadays cp -a works well everywhere (in my experience) so there's no need to resort to tar. It's just damage from the Unix wars.

A look at rsync performance

Posted Aug 19, 2010 19:31 UTC (Thu) by pj (subscriber, #4506) [Link]

One advantage is that it's easily modified to work over ssh:

tar cf - . | ssh user@remote "cd /dest/dir; tar xf -)

or

(ssh user@remote "cd /src/dir ; tar cf - . ") | (cd /dest/dir; tar xf -)

A look at rsync performance

Posted Aug 20, 2010 11:53 UTC (Fri) by NAR (subscriber, #1313) [Link]

A 'cp' command can also be easily modified to work over the network, just add an 's' to the front of 'cp' :-)

A look at rsync performance

Posted Aug 20, 2010 12:28 UTC (Fri) by dsommers (subscriber, #55274) [Link]

True ... but if you have a lot of files, especially smaller files, the tar path with ssh is way faster than scp. Try copying a git repository (~2-3MB) from one site to another site. My experiences is that tar+ssh beats scp significantly.

A look at rsync performance

Posted Aug 20, 2010 14:52 UTC (Fri) by spaetz (subscriber, #32870) [Link]

> but if you have a lot of files, especially smaller files, the tar path with ssh is way faster than scp. Try copying a git repository (~2-3MB) from one site to another site. My experiences is that tar+ssh beats scp significantly.

Only because you open a new ssh connection per file by default and tar+ssh opens only one. Which causes lots of overhead. If you reuse your ssh connection scp will be fast as well:
http://www.debian-administration.org/articles/290

A look at rsync performance

Posted Aug 21, 2010 2:05 UTC (Sat) by dmag (subscriber, #17775) [Link]

No, scp can't copy symlinks.

A look at rsync performance

Posted Aug 24, 2010 20:01 UTC (Tue) by BackSeat (subscriber, #1886) [Link]

No need for all the "cd" commands:
tar -C /src/dir -cf - . | tar -C /dest/dir -xf -

A look at rsync performance

Posted Aug 19, 2010 21:08 UTC (Thu) by evgeny (guest, #774) [Link]

There is one thing tar and cp -a do differently, which, depending on what you do could be either a feature or a misfeature. tar tries to restore files according to _literal_ names of the owner (if they exist; and they do by default). This can be overridden with the --numeric-owner flag. E.g. if you forget to specify this flag and untar a backup of a virtual container from the host, you'll end up with a mess of file ownerships. Was bitten by this once...

A look at rsync performance

Posted Aug 25, 2010 3:09 UTC (Wed) by roelofs (guest, #2599) [Link]

Nowadays cp -a works well everywhere (in my experience) so there's no need to resort to tar.

BSDs included? In my experience they've been mighty picky about the GNUisms (or "things that would have been GNUisms if someone else hadn't done them first") they're willing to implement. I remember being surprised by something along those lines just a couple of months ago, though I've forgotten the details already.

But perhaps cp -a came from BSD in the first place...

Greg

+1 Informative

Posted Aug 25, 2010 13:25 UTC (Wed) by dmarti (subscriber, #11625) [Link]

Just ssh-ed in to a FreeBSD 7.2 system -- `cp -a` works, and `-a` is in the man page.

A look at rsync performance

Posted Aug 19, 2010 10:50 UTC (Thu) by valhalla (subscriber, #56634) [Link]

cp -p preserves mode, ownership and timestamps, and the --preserve option can be used to do a finer selection of what should be preserved.

A look at rsync performance

Posted Aug 19, 2010 10:56 UTC (Thu) by dafid_b (guest, #67424) [Link]

ok - i read the fine cp manual page and now think that preserving ownership details etc are part of cp. However i am still confused as many of the notes in the manual page refer to topics I know nothing about.

I would use the tar pipe of old as I expect it to build a proper copy in the new location.

Having read the cp manual page again (and again) I fear my confidence in tar might be misplaced :(.

Anyone know of a tutorial for each of the cp options?

A look at rsync performance

Posted Aug 21, 2010 6:16 UTC (Sat) by dirtyepic (subscriber, #30178) [Link]

for me, --exclude.

A look at rsync performance

Posted Aug 19, 2010 10:53 UTC (Thu) by ewen (subscriber, #4772) [Link]

If you have a directory with, say, 100 * 2GB files in it, and another directory which has 96 of those files, and a few older ones, then using your tar pipeline requires transfering 200GB of data -- but using rsync only requires transferring 8GB of data. I know which I'd prefer. (And the tar technique still leaves you having to figure out which files no longer belong and remove them.)

There are a bunch of reasons for using rsync as shorthand for "make these two directories the same", even without needing the rsync algorithm to synchronise changes within an individual file. And it seems to me that adding a special case for "whole new file" into the rsync program, that copied with maximum efficiency, would be valuable. Which I think was (one of) the points of the original article.

Ewen

PS: I use "tar -cpf - . | (cd /dest && tar -xpf -)" for a bunch of safety reasons, and to preserve at least some permissions. With GNU tar that'll copy most things; with traditional unix tar, less so, but it gets closer than most tools on traditional unix. (GNU cp has an "-a" extension which will also preserve most things.)

PPS: For the later questioner, using a tar pipeline historically had better performance because it scheduled two processes which kept more I/O in flight. I've not looked recently to see if that's still the case, and given the performance numbers in the article it may not be the case (eg, the kernel's readahead may do just as well, if not better).

A look at rsync performance

Posted Aug 19, 2010 22:53 UTC (Thu) by Comet (subscriber, #11646) [Link]

-W, --whole-file copy files whole (without rsync algorithm)

A look at rsync performance

Posted Aug 20, 2010 4:29 UTC (Fri) by jcvw (subscriber, #50475) [Link]

I tried that. It doesn't help. The amount of user and system time is still incredibly high (compared to a simple cp). rsync doesn't use a tight read/write loop, like cp does, but (even in local cases) uses two processes, a socket and lots of select system calls. The -W doesn't change anything there (unfortunately).

A look at rsync performance

Posted Aug 20, 2010 4:46 UTC (Fri) by dlang (✭ supporter ✭, #313) [Link]

my first reaction on reading this is that the processes are stalling, and the behavior you describe to improve performance sounds like it's on the same tack.

did you try tweaking the buffer sizes that rsync uses to see if larger buffers may smooth things out a bit?

I would still expect it to take significantly more CPU time than a plain cp, I'm not bothered by that (it's really designed for a different job that it excels at, this is a degenerate corner case for it), but the throughput should be higher.

while the approach is less efficient than the read()/write() in a tight loop, if it can keep the buffers between the processes full it should be able to do fairly well (as your testing shows when you tweak things so that the threads aren't waiting for each other much), so perhaps larger buffers can avoid the stalls.

A look at rsync performance

Posted Aug 19, 2010 11:55 UTC (Thu) by zmower (subscriber, #3005) [Link]

Using netcat looks tricky (and less secure) compared to :
sender$ tar cf - * | ssh user@receiver tar -C $dir xvf -

A look at rsync performance

Posted Aug 23, 2010 23:49 UTC (Mon) by bronson (subscriber, #4806) [Link]

Yes, but it has WAY less overhead. I know there's a way to tell ssh to use a less CPU-heavy cipher but I always forget how.

A look at rsync performance

Posted Aug 20, 2010 12:01 UTC (Fri) by rvfh (subscriber, #31018) [Link]

Would this handle Ctrl-C better than cp? It's always a problem if you launch a 'cp -au' because the mtime of the file is only set once the copy is finished (for obvious reasons), so interrupting the copy leaves you with a broken file that's newer than anything else, and thus you cannot recover unless you find out which file it is (using find is an option, but can be slow over a big number of files).

(sorry for going OT)

A look at rsync performance

Posted Aug 23, 2010 10:39 UTC (Mon) by error27 (subscriber, #8346) [Link]

That should be "nc -q 0 receiver 1234". I wouldn't trust netcat for data I cared about...

A look at rsync performance

Posted Aug 26, 2010 14:44 UTC (Thu) by ariveira (guest, #57833) [Link]

> For bulk copies like this, I tend to use tar over a shell pipe:
> tar -cvf - * | (cd /someotherdir; tar -xvf -)

recently pax ( the posix archiver ) was brought to my attention

pax -rw . /someotherdir

Not to mention its awesome -s option.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds