A look at rsync performance
The problem
Recently I bought a shiny new disk for my Fedora-10 based Mythtv system. I had to copy some 700GiB of video files from the old disk to the new one. I am used to rsync for this type of job as the rsync command and its accompanying options flow right from my fingers to the keyboard. However, I was not happy with what I saw, as the performance was nothing to write home about: the files were copied at about 37MiB/s. Both disks can handle about three times that speed — at least on the outer cylinders. That makes a lot of difference: an expected wait of just over two hours changed into a six hour ordeal. Note that both SATA disks were local to the system and no network was involved.
Measuring
Wanting to know what happened, I created a small test to see what was going on: copying a 10GiB file from one disk to the other. I made sure that the ext4 file systems involved were completely fresh so fragmentation could not play a part (a new mkfs after each test.) I also made sure that the test file systems were created on the outermost (and fastest) cylinders of the disks. Simply reading the source file could be done at 106MiB/s and writing a 10GiB file to the destination file system could be done at 134MiB/s.
The copy programs under test were rsync, cpio, cp, and cat. Of course I took care that the cache could not interfere by flushing the cache before each test, and waiting for the dirty buffers to be flushed to the destination disk after the test command completes. For example, when the SRC and DEST are variables holding the name of the source file in the current directory and the name of the destination directory:
sync # flush dirty buffers to disk
echo 3 > /proc/sys/vm/drop_caches # discard caches
time sh -c "cp $SRC $DEST; sync" # measure cp and sync time
The echo command to /proc/sys/vm/drop_caches forces the invalidation of all non-dirty buffers in the page cache. To also force dirty pages to be flushed, we first use the sync command. The copy command will copy the 10GiB file, but it will actually finish before the last blocks have been flushed to disk. That is why we time the combination of the cp command and the sync command, which forces flushing the dirty blocks to disk.
The four commands tested were:
rsync $SRC $DEST
echo $SRC | cpio -p $DEST
cp $SRC $DEST
cat $SRC > $DEST/$SRC
The results for rsync, cpio, cp, and cat were:
user sys elapsed hog MiB/s test 5.24 77.92 101.86 81% 100.53 cpio 0.85 53.77 101.12 54% 101.27 cp 1.73 59.47 100.84 60% 101.55 cat 139.69 93.50 280.40 83% 36.52 rsync
The observation that rsync was slow was indeed substantiated. Looking at the hog factor (the amount of cpu-time used relative to the elapsed time), we can conclude that rsync is not so much disk-bound (as is to be expected), but cpu-bound. That required some more scrutiny. The atop program showed rsync appears to need three processes: one process that does only disk reads, one that does only disk writes and one (I assume) control process that uses little CPU time and does no disk I/O.
Using strace, it can be shown that cp only uses read() and write() system calls in a tight loop, while rsync uses two processes that talk to each other using reads and writes through a socket, sprinkled with loads of select() system calls. To simulate the multiple processes, I then used multiple cat processes strung together using pipes. That test does not show the bad performance that rsync demonstrates. To test the influence of using a socket, I also created a TCP service using xinetd that just starts cat with its output redirected to a file to simulate the "network traffic." The client side:
cat $SRC | nc localhost myservice
And the server side:
cat > $DEST
Even this setup outperforms rsync. It achieves the same disk bandwidth as cp with a far lower CPU load than rsync.
The kernel plays a role too
On my 4-core AMD Athlon II X4 620 system, all three processes seem to run on the same CPU most of the time. But with help from the taskset command, it is possible to force processes on specific sets of processors (cores). Suppose the three rsync processes have PID's 1111, 1112, 1113, they are forced each on their own core by:
taskset -pc 0 1111 # force on CPU0
taskset -pc 1 1112 # force on CPU1
taskset -pc 2 1113 # force on CPU2
By using taskset right after rsync was started, the throughput of rsync went up from 36.5MiB to 40MiB. Though a 10% improvement, it was still nowhere near cat's performance. When forcing the three rsync processes to run on the same CPU, performance went down to 32MiB/s
rsync needs quite a lot of CPU power (both user and system time). Despite that, the on-demand frequency governor does not scale up the CPU frequency. We can force all cores to run at the highest frequency with:
for i in 0 1 2 3 ; do
echo performance > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor
done
If the CPU-frequency is forced on the highest frequency (2.6GHz), the results for three rsyncs on a single core goes up: 62MiB/s. Combining this with the "spread the load" tactic using taskset, we even get up to 85MiB/s. Still 15% less than other copy programs, but more than a two-fold performance increase compared to the default situation.
The conclusion is that in the default situation, using cp over rsync will give you almost threefold better performance. However, a little tinkering with the scheduler (using taskset) and the cpufreq governor can get you a twofold performance improvement with rsync, but still only two-thirds that of cp.
Summarizing the results of the test with rsync:
Throughput CPUs Core frequency 22MiB/s 1-3 0.8GHz 23MiB/s 1 0.8GHz 34MiB/s 1 ondemand 37MiB/s 1-3 ondemand << default 39MiB/s 3 0.8GHz 40MiB/s 3 ondemand 62MiB/s 1 2.6GHz 62MiB/s 1-3 2.6GHz 85MiB/s 3 2.6GHz
In this table, the second column shows how the rsyncs were distributed over the cores. 1 CPU means the three rsyncs were forced on the same one single CPU. 1-3 CPUs means the scheduler could do what it saw fit. And finally when the three rsyncs were each forced on their own CPU, the table shows 3 CPUs.
It is clear that the default setting are not the worst settings, but close to it.
The future
An LWN article described problems that the ondemand scheduler has in choosing the right CPU frequency for processes that do a lot of I/O and need a lot of CPU power (like rsync). In the time the processor is waiting for the I/O to finish, the clock frequency is scaled down almost immediately. But when the disk request finishes, and the process continues using the CPU, the ondemand governor waits too long in scaling the frequency back up again. Arjan van de Ven of the Intel Open Source Technology Centre has made changes to the ondemand governor that won't scale the CPU down until the CPU is really idle, and not just waiting for fast I/O.The bad behavior can be seen using cpufreq_stats. After loading the module:
modprobe cpufreq_stats
it is possible to see how much time was spent in each frequency by
which core. If we look at the results after the rsync command, we
see for CPU 2:
$ cat /sys/devices/system/cpu/cpu2/cpufreq/stats/time_in_state
2600000 423293
1900000 363
1400000 534
800000 6645805
The frequency (in 1000Hz units) is the first column, while the time (in
10ms units) is the second column.
Since the module was loaded, CPU2 has spent most time on the lowest
frequency, despite the fact that rsync really is quite CPU-intensive.
After all these results, I decided to give Arjan's patches a try. I compiled kernel version 2.6.35-rc3 that has the patches incorporated and used that instead of the 2.6.27.41-170.2.117 kernel Fedora 10 was running when the original problem popped up. For comparison, I also ran the tests with a more recent kernel that does not incorporate Arjan's patches: 2.6.34
I could immediately see (in atop) that the three rsync processes were on separate processors most of the time. The newer kernels apparently are better at spreading the load. However, this is not a great help:
FC10 2.6.34 2.6.35-rc3 CPUs Frequency MiB/s MiB/s MiB/s 23.12 28.85 28.07 1 0.8GHz 22.19 44.23 45.25 1-3 0.8GHz 38.62 43.39 43.75 3 0.8GHz 34.01 55.48 57.37 1 ondemand 36.52 44.85 45.08 1-3 ondemand <<default 39.73 43.65 44.30 3 ondemand 62.37 66.67 68.52 1 2.6GHz 62.15 92.34 91.84 1-3 2.6GHz 85.47 89.79 89.42 3 2.6GHz
Conclusions
One thing is clear: I should upgrade the kernel on my Mythtv system. In general, the 2.6.34 and 2.6.35-rc3 kernels give better performance than the old 2.6.27 kernel. But, tinkering or not, rsync can still not beat a simple cp that copies at over 100MiB/s. Indeed, rsync really needs a lot of CPU power for simple local copies. At the highest frequency, cp only needed 0.34+20.95 seconds CPU time, compared with rsync's 70+55 seconds.The newer kernels are better at spreading the processes over the cores. However, this is hindering Arjan van de Ven's patch from doing its work. The patch does indeed work when all rsync processes run on a single CPU. But because the new kernel does a better job of spreading the processes over CPUs, Arjan's frequency increase does not occur. Arjan is working on an entirely new governor that may be better at raising the CPU's frequency when doing a lot of disk I/O.
| Index entries for this article | |
|---|---|
| GuestArticles | Van Winkel, Jan Christiaan |
