March 17, 2010
This article was contributed by Mel Gorman
[
Editor's note: this is part 4 of Mel Gorman's series on support for
huge pages in Linux. Parts 1, 2, and 3 are available for those who
have not read them yet.]
In this installment, a small number of benchmarks are configured to use huge
pages - STREAM, sysbench, SpecCPU 2006 and SpecJVM. In doing so, we show
that utilising
huge pages is a lot easier than in the past. In all cases, there is a heavy
reliance on the hugeadm to simplify the machine configuration
and hugectl to configure libhugetlbfs.
STREAM is a memory-intensive benchmark and, while its reference pattern
has poor spacial and temporal locality, it can benefit from reduced TLB
references. Sysbench is a simple OnLine Transaction Processing
(OLTP) benchmark that can use Oracle, MySQL, or PostgreSQL as database
backends. While there are better OLTP benchmarks out there, Sysbench is
very simple to set up and reasonable for illustration. SpecCPU 2006 is a
computational benchmark of interest to high-performance computing (HPC) and
SpecJVM benchmarks basic
classes of Java applications.
1 Machine Configuration
The machine used for this study is a Terrasoft Powerstation described in
the table below.
| Architecture
| PPC64
|
| CPU
| PPC970MP with altivec
|
| CPU Frequency
| 2.5GHz
|
| # Physical CPUs
| 2 (4 cores)
|
| L1 Cache per core
| 32K Data, 64K Instruction
|
| L2 Cache per core
| 1024K Unified
|
| L3 Cache per socket
| N/a
|
| Main Memory
| 8 GB
|
| Mainboard
| Machine model specific
|
| Superpage Size
| 16MB
|
| Machine Model
| Terrasoft Powerstation |
Configuring the system for use with
huge pages was a simple matter of performing the following commands.
$ hugeadm --create-global-mounts
$ hugeadm --pool-pages-max DEFAULT:8G
$ hugeadm --set-recommended-min_free_kbytes
$ hugeadm --set-recommended-shmmax
$ hugeadm --pool-pages-min DEFAULT:2048MB
$ hugeadm --pool-pages-max DEFAULT:8192MB
2 STREAM
STREAM [mccalpin07] is a synthetic memory bandwidth benchmark that
measures the performance of four long vector operations: Copy, Scale, Add, and
Triad. It can be used to calculate the number of floating point operations
that can be performed during the time for the average memory access.
Simplistically, more bandwidth is better.
The C version of the benchmark was selected and used three statically
allocated arrays for calculations. Modified versions of the benchmark
using malloc() and get_hugepage_region() were found
to have similar performance characteristics.
The benchmark has two parameters: N, the size of the array, and
OFFSET, the number of elements padding the end of the array. A
range of values for N were used to generate workloads between
128K and 3GB in size. For each size of N chosen, the benchmark
was run 10 times and an average taken. The benchmark is sensitive to cache
placement and optimal layout varies between architectures; where the standard deviation of 10 iterations
exceeded 5% of the throughput, OFFSET was increased to add one
cache-line of padding between the arrays and the benchmark for that value
of N was rerun. High standard deviations were only observed when the
total working set was around the size of the L1, L2 or all caches combined.
The benchmark avoids data re-use, be it in registers or in the cache. Hence,
benefits from huge pages would be due to fewer faults, a slight reduction in
TLB misses as fewer TLB entries are needed for the working set and an increase
in available cache as less translation information needs to be stored.
To use huge pages, the benchmark was first compiled with the
libhugetlbfs ld wrapper to align the text and data sections to
a huge page boundary [libhtlb09] such as in the following example.
$ gcc -DN=1864135 -DOFFSET=0 -O2 -m64 \
-B /usr/share/libhugetlbfs -Wl,--hugetlbfs-align \
-Wl,--library-path=/usr/lib \
-Wl,--library-path=/usr/lib64 \
-lhugetlbfs stream.c \
-o stream
# Test launch of benchmark
$ hugectl --text --data --no-preload ./stream
This page contains plots
showing the performance
results for a range of sizes running on the test machine; one of them
appears to the right.
Performance
improvements range from 11.6% to 16.59% depending on the operation
in use. Performance improvements would be typically lower for an X86 or
X86-64 machine, likely in the 0% to 4% range.
3 SysBench
SysBench is a OnLine Transaction Processing
(OLTP) benchmark representing a general class of workload where clients
perform a sequence of operations whose end result must appear to be an
indivisible operation. TPC-C
is considered an industry standard for the evaluation of OLTP but
requires significant capital investment and is extremely complex to set up.
SysBench is a system
performance benchmark comprising file I/O, scheduler, memory allocation,
threading and includes an OLTP benchmark. The setup requirements are less
complicated and SysBench works for MySQL, PostgreSQL, and Oracle
databases.
PostgreSQL was used for this
experiment on the grounds that it uses a shared memory segment similar
to Oracle, making it a meaningful comparison with a commercial database
server. Sysbench 0.4.12 and Postgres 8.4.0 were built from source.
Postgres was configured to use a 756MB shared buffer, an effective cache
of 150MB, a maximum of 6*NR_CPUs clients were allowed to
connect. Note that the maximum number of clients allowed is greater than the
number of clients used in the test. This is because a typical configuration
would allow more connections than the expected number of clients to allow
administrative processes to connect. The update_process_title
parameter was turned off as a small optimisation. Options that checkout,
fsync, log, or synchronise were turned off to avoid interference from I/O.
The system was configured to allow the postgres user to use
huge pages with shmget() as described in part 3.
Postgres uses System V shared memory so pg_ctl
was invoked as follows.
$ hugectl --shm bin/pg_ctl -D `pwd`/data -l logfile start
For the test itself, the table size was 10 million rows, read-only to
avoid I/O and the test type was complex, which means each operation by
the client is a database transaction. Tests were run varying the number
of clients accessing the database from one to four times the number of CPU
cores in the system. For each thread count, the test was run multiple times
until at least five iterations completed with a confidence level of 99%
that the estimated mean is within 2% of the true mean. In practise, the
initial iteration gets discarded due to increased I/O and faults incurred
during the first run.
The plot to the right (click for larger version) shows the performance
results for different numbers of threads with performance improvements
ranging in the 1%-3.5% mark. Unlike STREAM, the performance improvements
would tend to be similar on X86 and X86-64 machines running this particular
test configuration. The exact reasoning for this is beyond the scope of
the article but it comes down to the fact that STREAM exhibits a very poor
locality of reference, making cache behaviour a significant factor in the
performance of the workload. As workloads would typically have a greater
degree of reference locality than STREAM, the expectation would be that
performance gains across different architectures would be similar.
4 SpecCPU 2006
SpecCPU 2006 v1.1 is a
standardised CPU-intensive benchmark used in evaluations for HPC that also
stresses the memory subsystem. A --reportable run was made
comprising test, train, and three ref sets of input data. Three
sets of runs compare base pages, huge pages backing just the heap, and
huge pages backing text, data, and the heap. Only base tuning was used
with no special compile options other than what was required to compile
the tests.
To back the heap using huge pages, the tests were run with:
hugectl --heap runspec ...
To also back the text and data, the SPEC configuration
file was modified to build SPEC similar to STREAM described above,
then the --text --data --bss
switches were also specified to hugectl.
This plot shows the performance
results running the integer SpecCPU test (click for full size and the
floating-point test results). As is clear, there are very
large fluctuations depending on what the reference pattern of the workload
was but many of the improvements are quite significant averaging around 13%
for the Integer benchmarks and 7-8% for the floating-point operations. An
interesting point to note is that for the Fortran applications, performance
gains were similar whether text/data was backed or the heap. This heavily
implies that the Fortran applications were using dynamic allocation. On
older Fortran applications, relinking to back the text and data with huge
pages may be required to see any performance gains.
5 SpecJVM (JVM/General)
Java is used in an increasing number of scenarios, including real
time systems, and it dominates in the execution of business-logic
related applications. Particularly within application servers, the
Java Virtual Machine (JVM) uses large quantities of virtual
address space that can benefit from being backed by huge pages. SpecJVM 2008 is a benchmark
suite for Java Runtime Environments (JRE). According to the
documentation, the intention is to reflect the performance of the processor
and memory system with a low dependence on file or network I/O. Crucially for
HPC, it includes SCIMark,
which is a Java benchmark for scientific and numerical computing.
The 64-bit version of IBM Java Standard Edition Version 6 SP 3 was used, but
support for huge pages is available in other JVMs. The JVM was configured
to use a maximum of 756MB for the heap. Unlike the other benchmarks, the JVM
is huge-page-aware and uses huge-page-backed shared memory segments when
-Xlp is specified. An example invocation of the benchmark is as
follows.
$ java -Xlp -Xmx756m -jar SPECjvm2008.jar 120 300 --parseJvmArgs -i 1 --peak
This plot shows the performance
results running the full range of SpecJVM tests. The results are interesting
as they show performance gains were not universal, with the serial
benchmark being spectacularly poor. Despite this, performance was improved
on average by 4.43% with very minimal work required on behalf of the
administrator.
6 Summary
In this installment, it was shown that with minimal amounts of additional work,
huge pages can be easily used to improve benchmarks. For the database
and JVM benchmarks, the same configurations could easily be applied to a
real-world deployment rather than as a benchmarking situation. For other
benchmarks, the effort can be hidden with minimal use of initialisation
scripts. Using huge pages on Linux in the past was a tricky affair but
these examples show this is no longer the case.
(
Log in to post comments)