August 18, 2010
This article was contributed by JC van Winkel
The problem
Recently I bought a shiny new disk for my Fedora-10 based Mythtv
system. I had to copy some 700GiB of video files from the old disk to
the new one. I am used to rsync for this type of job as
the rsync
command and its accompanying options flow right from my fingers to the
keyboard. However, I was not happy with what I saw, as the performance
was nothing to write home about: the files were
copied at about 37MiB/s. Both disks can handle about three times that
speed —
at least on the outer cylinders. That makes a lot of difference: an
expected wait of just over two hours changed into a six hour
ordeal. Note that both SATA disks were local to the system and no
network was involved.
Measuring
Wanting to know what happened, I created a small test to see what
was going on: copying a 10GiB file from one disk to the other. I made
sure that the ext4 file systems involved were completely fresh so
fragmentation could not play a part (a new mkfs after each test.) I
also made sure that the test file systems were created on the outermost
(and fastest) cylinders of the disks.
Simply reading the source file could be done at 106MiB/s and
writing a 10GiB file to the destination file system could be done at
134MiB/s.
The copy programs under test were rsync, cpio,
cp, and cat. Of course
I took care that the cache could not interfere by flushing the cache before
each test, and waiting for the dirty buffers to be flushed to the
destination disk after the test command completes. For example,
when the SRC and DEST are variables holding the name of the source file
in the current directory and the name of the destination directory:
sync # flush dirty buffers to disk
echo 3 > /proc/sys/vm/drop_caches # discard caches
time sh -c "cp $SRC $DEST; sync" # measure cp and sync time
The echo command to /proc/sys/vm/drop_caches forces the invalidation
of all non-dirty buffers in the page cache. To also force dirty
pages to be flushed, we first use the sync command. The copy command will copy the 10GiB file, but it will actually finish
before the last blocks have been flushed to disk. That is why we time
the combination of the cp command and
the sync command, which forces
flushing the dirty blocks to disk.
The four commands tested were:
rsync $SRC $DEST
echo $SRC | cpio -p $DEST
cp $SRC $DEST
cat $SRC > $DEST/$SRC
The results for rsync, cpio, cp, and
cat were:
| user | sys | elapsed | hog | MiB/s | test |
| 5.24 | 77.92 | 101.86 | 81% | 100.53 | cpio |
| 0.85 | 53.77 | 101.12 | 54% | 101.27 | cp |
| 1.73 | 59.47 | 100.84 | 60% | 101.55 | cat |
| 139.69 | 93.50 | 280.40 | 83% | 36.52 | rsync |
The observation that rsync was slow was indeed substantiated. Looking
at the hog factor (the amount of cpu-time used relative to the elapsed
time), we can conclude that rsync is not so much disk-bound (as is to
be expected), but cpu-bound. That required some more scrutiny. The
atop program
showed rsync appears to need three processes: one process that does
only disk reads, one that does only disk writes and one (I assume)
control process that uses little CPU time and does no disk I/O.
Using strace, it can be shown that cp only uses
read() and write() system calls in a tight loop, while
rsync uses two processes that talk to each other using reads and
writes through a socket, sprinkled with loads of select() system
calls. To simulate the multiple processes, I then used multiple
cat processes strung together using pipes. That test does not
show the bad performance that rsync demonstrates. To test the
influence of using a socket, I also created a TCP service using
xinetd that just starts cat with its output redirected to
a file to simulate the "network traffic." The client side:
cat $SRC | nc localhost myservice
And the server side:
cat > $DEST
Even this setup outperforms
rsync. It achieves the same disk bandwidth as
cp with a far lower CPU load than
rsync.
The kernel plays a role too
On
my 4-core AMD Athlon II X4 620 system, all three processes seem to run
on the same CPU most of the time.
But with help from the
taskset command, it is possible to force processes on
specific sets of processors (cores). Suppose the three
rsync
processes have PID's 1111, 1112, 1113, they are forced each on their
own core by:
taskset -pc 0 1111 # force on CPU0
taskset -pc 1 1112 # force on CPU1
taskset -pc 2 1113 # force on CPU2
By using taskset right after rsync was started, the throughput of
rsync went up from 36.5MiB to 40MiB. Though a 10% improvement, it was
still nowhere near cat's performance. When forcing the three rsync
processes to run on the same CPU, performance went down to 32MiB/s
rsync needs quite a lot of CPU power (both user and system time).
Despite that, the on-demand frequency governor does not scale up the
CPU frequency. We can force all cores to run at the highest frequency
with:
for i in 0 1 2 3 ; do
echo performance > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor
done
If the CPU-frequency is forced on the highest frequency (2.6GHz), the
results for three rsyncs on a single core goes up: 62MiB/s. Combining
this with the "spread the load" tactic using taskset, we even get up
to 85MiB/s. Still 15% less than other copy programs, but more than a
two-fold performance increase compared to the default situation.
The conclusion is that in the default situation, using cp over rsync
will give you almost threefold better performance. However, a little tinkering
with the scheduler (using taskset) and the cpufreq governor can get
you a twofold performance improvement with rsync, but still only
two-thirds that of cp.
Summarizing the results of the test with rsync:
| Throughput | CPUs | Core frequency |
| 22MiB/s | 1-3 | 0.8GHz |
| 23MiB/s | 1 | 0.8GHz |
| 34MiB/s | 1 | ondemand |
| 37MiB/s | 1-3 | ondemand | << default |
| 39MiB/s | 3 | 0.8GHz |
| 40MiB/s | 3 | ondemand |
| 62MiB/s | 1 | 2.6GHz |
| 62MiB/s | 1-3 | 2.6GHz |
| 85MiB/s | 3 | 2.6GHz |
In this table, the second column shows how the rsyncs were distributed
over the cores. 1 CPU means the three rsyncs were forced on the same one
single CPU. 1-3 CPUs means the scheduler could do what it saw fit.
And finally when the three rsyncs were each forced on their own CPU,
the table shows 3 CPUs.
It is clear that the default setting are not the worst settings,
but close to it.
The future
An
LWN article described problems that the
ondemand scheduler has in choosing the right CPU frequency
for processes that do a lot of I/O and need a lot of CPU power (like
rsync). In the time the processor is waiting for the I/O to finish,
the clock frequency is scaled down almost immediately. But when the
disk request finishes, and the process continues using the CPU, the
ondemand governor waits too long in scaling the frequency back up again.
Arjan van de Ven of the Intel Open Source Technology Centre has made
changes to the
ondemand
governor that won't scale the CPU down until the CPU is really
idle, and not just waiting for fast I/O.
The bad behavior can be seen using cpufreq_stats. After loading
the module:
modprobe cpufreq_stats
it is possible to see how much time was spent in each frequency by
which core. If we look at the results after the
rsync command, we
see for CPU 2:
$ cat /sys/devices/system/cpu/cpu2/cpufreq/stats/time_in_state
2600000 423293
1900000 363
1400000 534
800000 6645805
The frequency (in 1000Hz units) is the first column, while the time (in
10ms units) is the second column.
Since the module was loaded, CPU2 has spent most time on the lowest
frequency, despite the fact that
rsync really is quite CPU-intensive.
After all these results, I decided to give Arjan's patches a try.
I compiled kernel version 2.6.35-rc3 that has the patches
incorporated and used that instead of the 2.6.27.41-170.2.117 kernel
Fedora 10 was running when the original problem popped up. For
comparison, I also ran the tests with a more recent kernel that does
not incorporate Arjan's patches: 2.6.34
I could immediately see (in atop) that the three rsync processes were
on separate processors most of the time. The newer kernels apparently
are better at spreading the load. However, this is not a great help:
| FC10 | 2.6.34 | 2.6.35-rc3 | CPUs | Frequency |
| MiB/s | MiB/s | MiB/s | | |
| 23.12 | 28.85 | 28.07 | 1 | 0.8GHz |
| 22.19 | 44.23 | 45.25 | 1-3 | 0.8GHz |
| 38.62 | 43.39 | 43.75 | 3 | 0.8GHz |
| 34.01 | 55.48 | 57.37 | 1 | ondemand |
| 36.52 | 44.85 | 45.08 | 1-3 | ondemand | <<default |
| 39.73 | 43.65 | 44.30 | 3 | ondemand |
| 62.37 | 66.67 | 68.52 | 1 | 2.6GHz |
| 62.15 | 92.34 | 91.84 | 1-3 | 2.6GHz |
| 85.47 | 89.79 | 89.42 | 3 | 2.6GHz |
Conclusions
One thing is clear: I should upgrade the kernel on my Mythtv system.
In general, the 2.6.34 and 2.6.35-rc3 kernels give better performance
than the old 2.6.27 kernel. But, tinkering or not,
rsync can
still not beat a simple
cp that copies at over 100MiB/s.
Indeed,
rsync really needs a lot of CPU power for simple
local copies. At the highest frequency,
cp only needed 0.34+20.95
seconds CPU time, compared with
rsync's 70+55 seconds.
The newer kernels are better at spreading the processes over the
cores. However, this is hindering Arjan van de Ven's patch from doing its
work. The patch does indeed work when all rsync processes
run on a single CPU. But because the new kernel does a better job of
spreading the processes over CPUs, Arjan's frequency increase does not
occur. Arjan is working on an entirely new governor that
may be better at raising the CPU's frequency when doing a lot of disk
I/O.
(
Log in to post comments)