LWN.net Logo

Development

A look at rsync performance

August 18, 2010

This article was contributed by JC van Winkel

The problem

Recently I bought a shiny new disk for my Fedora-10 based Mythtv system. I had to copy some 700GiB of video files from the old disk to the new one. I am used to rsync for this type of job as the rsync command and its accompanying options flow right from my fingers to the keyboard. However, I was not happy with what I saw, as the performance was nothing to write home about: the files were copied at about 37MiB/s. Both disks can handle about three times that speed — at least on the outer cylinders. That makes a lot of difference: an expected wait of just over two hours changed into a six hour ordeal. Note that both SATA disks were local to the system and no network was involved.

Measuring

Wanting to know what happened, I created a small test to see what was going on: copying a 10GiB file from one disk to the other. I made sure that the ext4 file systems involved were completely fresh so fragmentation could not play a part (a new mkfs after each test.) I also made sure that the test file systems were created on the outermost (and fastest) cylinders of the disks. Simply reading the source file could be done at 106MiB/s and writing a 10GiB file to the destination file system could be done at 134MiB/s.

The copy programs under test were rsync, cpio, cp, and cat. Of course I took care that the cache could not interfere by flushing the cache before each test, and waiting for the dirty buffers to be flushed to the destination disk after the test command completes. For example, when the SRC and DEST are variables holding the name of the source file in the current directory and the name of the destination directory:

    sync                               # flush dirty buffers to disk
    echo 3 > /proc/sys/vm/drop_caches  # discard caches
    time sh -c "cp $SRC $DEST; sync"   # measure cp and sync time

The echo command to /proc/sys/vm/drop_caches forces the invalidation of all non-dirty buffers in the page cache. To also force dirty pages to be flushed, we first use the sync command. The copy command will copy the 10GiB file, but it will actually finish before the last blocks have been flushed to disk. That is why we time the combination of the cp command and the sync command, which forces flushing the dirty blocks to disk.

The four commands tested were:

    rsync $SRC $DEST
    echo $SRC | cpio -p $DEST
    cp  $SRC $DEST
    cat $SRC > $DEST/$SRC

The results for rsync, cpio, cp, and cat were:

usersyselapsedhogMiB/stest
5.2477.92101.8681%100.53cpio
0.8553.77101.1254%101.27cp
1.7359.47100.8460%101.55cat
139.6993.50280.4083%36.52rsync

The observation that rsync was slow was indeed substantiated. Looking at the hog factor (the amount of cpu-time used relative to the elapsed time), we can conclude that rsync is not so much disk-bound (as is to be expected), but cpu-bound. That required some more scrutiny. The atop program showed rsync appears to need three processes: one process that does only disk reads, one that does only disk writes and one (I assume) control process that uses little CPU time and does no disk I/O.

Using strace, it can be shown that cp only uses read() and write() system calls in a tight loop, while rsync uses two processes that talk to each other using reads and writes through a socket, sprinkled with loads of select() system calls. To simulate the multiple processes, I then used multiple cat processes strung together using pipes. That test does not show the bad performance that rsync demonstrates. To test the influence of using a socket, I also created a TCP service using xinetd that just starts cat with its output redirected to a file to simulate the "network traffic." The client side:

    cat $SRC | nc localhost myservice
And the server side:
     cat > $DEST
Even this setup outperforms rsync. It achieves the same disk bandwidth as cp with a far lower CPU load than rsync.

The kernel plays a role too

On my 4-core AMD Athlon II X4 620 system, all three processes seem to run on the same CPU most of the time. But with help from the taskset command, it is possible to force processes on specific sets of processors (cores). Suppose the three rsync processes have PID's 1111, 1112, 1113, they are forced each on their own core by:

    taskset -pc 0 1111   # force on CPU0
    taskset -pc 1 1112   # force on CPU1
    taskset -pc 2 1113   # force on CPU2

By using taskset right after rsync was started, the throughput of rsync went up from 36.5MiB to 40MiB. Though a 10% improvement, it was still nowhere near cat's performance. When forcing the three rsync processes to run on the same CPU, performance went down to 32MiB/s

rsync needs quite a lot of CPU power (both user and system time). Despite that, the on-demand frequency governor does not scale up the CPU frequency. We can force all cores to run at the highest frequency with:

    for i in 0 1 2 3 ; do
      echo performance > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor
    done

If the CPU-frequency is forced on the highest frequency (2.6GHz), the results for three rsyncs on a single core goes up: 62MiB/s. Combining this with the "spread the load" tactic using taskset, we even get up to 85MiB/s. Still 15% less than other copy programs, but more than a two-fold performance increase compared to the default situation.

The conclusion is that in the default situation, using cp over rsync will give you almost threefold better performance. However, a little tinkering with the scheduler (using taskset) and the cpufreq governor can get you a twofold performance improvement with rsync, but still only two-thirds that of cp.

Summarizing the results of the test with rsync:

ThroughputCPUsCore frequency
22MiB/s1-30.8GHz
23MiB/s10.8GHz
34MiB/s1ondemand
37MiB/s1-3ondemand<< default
39MiB/s30.8GHz
40MiB/s3ondemand
62MiB/s12.6GHz
62MiB/s1-32.6GHz
85MiB/s32.6GHz

In this table, the second column shows how the rsyncs were distributed over the cores. 1 CPU means the three rsyncs were forced on the same one single CPU. 1-3 CPUs means the scheduler could do what it saw fit. And finally when the three rsyncs were each forced on their own CPU, the table shows 3 CPUs.

It is clear that the default setting are not the worst settings, but close to it.

The future

An LWN article described problems that the ondemand scheduler has in choosing the right CPU frequency for processes that do a lot of I/O and need a lot of CPU power (like rsync). In the time the processor is waiting for the I/O to finish, the clock frequency is scaled down almost immediately. But when the disk request finishes, and the process continues using the CPU, the ondemand governor waits too long in scaling the frequency back up again. Arjan van de Ven of the Intel Open Source Technology Centre has made changes to the ondemand governor that won't scale the CPU down until the CPU is really idle, and not just waiting for fast I/O.

The bad behavior can be seen using cpufreq_stats. After loading the module:

    modprobe cpufreq_stats
it is possible to see how much time was spent in each frequency by which core. If we look at the results after the rsync command, we see for CPU 2:
    $ cat /sys/devices/system/cpu/cpu2/cpufreq/stats/time_in_state
    2600000 423293
    1900000 363
    1400000 534
    800000 6645805
The frequency (in 1000Hz units) is the first column, while the time (in 10ms units) is the second column. Since the module was loaded, CPU2 has spent most time on the lowest frequency, despite the fact that rsync really is quite CPU-intensive.

After all these results, I decided to give Arjan's patches a try. I compiled kernel version 2.6.35-rc3 that has the patches incorporated and used that instead of the 2.6.27.41-170.2.117 kernel Fedora 10 was running when the original problem popped up. For comparison, I also ran the tests with a more recent kernel that does not incorporate Arjan's patches: 2.6.34

I could immediately see (in atop) that the three rsync processes were on separate processors most of the time. The newer kernels apparently are better at spreading the load. However, this is not a great help:

FC10  2.6.34  2.6.35-rc3  CPUs  Frequency
MiB/sMiB/sMiB/s
23.1228.8528.0710.8GHz
22.1944.2345.251-30.8GHz
38.6243.3943.7530.8GHz
34.0155.4857.371ondemand
36.5244.8545.081-3ondemand <<default
39.7343.6544.303ondemand
62.3766.6768.5212.6GHz
62.1592.3491.841-32.6GHz
85.4789.7989.4232.6GHz

Conclusions

One thing is clear: I should upgrade the kernel on my Mythtv system. In general, the 2.6.34 and 2.6.35-rc3 kernels give better performance than the old 2.6.27 kernel. But, tinkering or not, rsync can still not beat a simple cp that copies at over 100MiB/s. Indeed, rsync really needs a lot of CPU power for simple local copies. At the highest frequency, cp only needed 0.34+20.95 seconds CPU time, compared with rsync's 70+55 seconds.

The newer kernels are better at spreading the processes over the cores. However, this is hindering Arjan van de Ven's patch from doing its work. The patch does indeed work when all rsync processes run on a single CPU. But because the new kernel does a better job of spreading the processes over CPUs, Arjan's frequency increase does not occur. Arjan is working on an entirely new governor that may be better at raising the CPU's frequency when doing a lot of disk I/O.

Comments (44 posted)

Brief items

Quote of the week

Combining both of their work together, they have been able to make a 20 minute long voice call from a baseband processor running a Free Software GSM stack. For all we know, it is the first time anything remotely like this has been done using community-developed Free Software. Five years ago I would have thought it's impossible to pull this off with a small team of volunteers. I'm very happy to see that I was wrong, and we actually could do it. With less than half a dozen of developers, in less than nine months of unpaid, spare-time work.
-- Harald Welte

Comments (none posted)

Anjuta DevStudio 2.31.90 released

The Anjuta DevStudio 2.31.90 release is out. The big news appears to be full support for developing in Python, and initial support for plugins written in Python.

Full Story (comments: none)

MPRIS v2.0 released

The 2.0 release of the Media Player Remote Interfacing Specification has been released; it is a D-Bus-based interface for controlling media players. "This version is an almost complete rewrite of the specification and enumerating the changes would be long and boring."

Full Story (comments: none)

Ruby 1.9.2 released

The Ruby 1.9.2 release is out. Changes include a new socket API with IPv6 support, a new random class, a reimplemented time module with no 2038 problem, and "many new methods." See the NEWS file for more information.

Comments (1 posted)

Newsletters and articles

Development newsletters from the last week

Comments (none posted)

Algorithmic Music Composition With Linux - athenaCL (Linux Journal)

Dave Phillips takes a look at athenaCL in the conclusion to his survey of algorithmic music composition systems for Linux. "In many ways athenaCL is the most comprehensive system that I've used for algorithmic composition. Its feature set is rich in familiar and unusual resources, and its reliance on Python eases the way into the system through a powerful general-purpose programming language."

Comments (1 posted)

Free and Open-Source Software—An Analog Devices Perspective

Analog Devices Inc. (ADI), maker of the Blackfin CPU and a wide variety of input and output devices targeted at embedded system designers, has put together a white paper for its customers to help them understand free and open source software (FOSS). While it uses examples specific to ADI devices, it would be useful to many embedded developers who are trying to wrap their heads around FOSS. "The popularity of FOSS in the embedded markets is dominated by simple economic motivation—it lowers software costs and hastens time to market. It turns "roll-your-own" developers into system-level integrators who can focus on adding value and differentiating features of their products rather than reproducing the same base infrastructure over and over again. It is the only proven methodology to reel in out-of-control software development costs."

Comments (none posted)

Page editor: Jonathan Corbet
Next page: Announcements>>

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds