LWN.net Logo

Huge pages part 4: benchmarking with huge pages

March 17, 2010

This article was contributed by Mel Gorman

[Editor's note: this is part 4 of Mel Gorman's series on support for huge pages in Linux. Parts 1, 2, and 3 are available for those who have not read them yet.]

In this installment, a small number of benchmarks are configured to use huge pages - STREAM, sysbench, SpecCPU 2006 and SpecJVM. In doing so, we show that utilising huge pages is a lot easier than in the past. In all cases, there is a heavy reliance on the hugeadm to simplify the machine configuration and hugectl to configure libhugetlbfs.

STREAM is a memory-intensive benchmark and, while its reference pattern has poor spacial and temporal locality, it can benefit from reduced TLB references. Sysbench is a simple OnLine Transaction Processing (OLTP) benchmark that can use Oracle, MySQL, or PostgreSQL as database backends. While there are better OLTP benchmarks out there, Sysbench is very simple to set up and reasonable for illustration. SpecCPU 2006 is a computational benchmark of interest to high-performance computing (HPC) and SpecJVM benchmarks basic classes of Java applications.

1 Machine Configuration

The machine used for this study is a Terrasoft Powerstation described in the table below.

Architecture PPC64
CPU PPC970MP with altivec
CPU Frequency 2.5GHz
# Physical CPUs 2 (4 cores)
L1 Cache per core 32K Data, 64K Instruction
L2 Cache per core 1024K Unified
L3 Cache per socket N/a
Main Memory 8 GB
Mainboard Machine model specific
Superpage Size 16MB
Machine Model Terrasoft Powerstation

Configuring the system for use with huge pages was a simple matter of performing the following commands.

    $ hugeadm --create-global-mounts
    $ hugeadm --pool-pages-max DEFAULT:8G 
    $ hugeadm --set-recommended-min_free_kbytes
    $ hugeadm --set-recommended-shmmax
    $ hugeadm --pool-pages-min DEFAULT:2048MB
    $ hugeadm --pool-pages-max DEFAULT:8192MB

2 STREAM

STREAM [mccalpin07] is a synthetic memory bandwidth benchmark that measures the performance of four long vector operations: Copy, Scale, Add, and Triad. It can be used to calculate the number of floating point operations that can be performed during the time for the “average” memory access. Simplistically, more bandwidth is better.

The C version of the benchmark was selected and used three statically allocated arrays for calculations. Modified versions of the benchmark using malloc() and get_hugepage_region() were found to have similar performance characteristics.

The benchmark has two parameters: N, the size of the array, and OFFSET, the number of elements padding the end of the array. A range of values for N were used to generate workloads between 128K and 3GB in size. For each size of N chosen, the benchmark was run 10 times and an average taken. The benchmark is sensitive to cache placement and optimal layout varies between architectures; where the standard deviation of 10 iterations exceeded 5% of the throughput, OFFSET was increased to add one cache-line of padding between the arrays and the benchmark for that value of N was rerun. High standard deviations were only observed when the total working set was around the size of the L1, L2 or all caches combined.

The benchmark avoids data re-use, be it in registers or in the cache. Hence, benefits from huge pages would be due to fewer faults, a slight reduction in TLB misses as fewer TLB entries are needed for the working set and an increase in available cache as less translation information needs to be stored.

To use huge pages, the benchmark was first compiled with the libhugetlbfs ld wrapper to align the text and data sections to a huge page boundary [libhtlb09] such as in the following example.

   $ gcc -DN=1864135 -DOFFSET=0 -O2 -m64                     \
        -B /usr/share/libhugetlbfs -Wl,--hugetlbfs-align     \
        -Wl,--library-path=/usr/lib                          \
        -Wl,--library-path=/usr/lib64                        \
        -lhugetlbfs stream.c                                 \
        -o stream

   # Test launch of benchmark
   $ hugectl --text --data --no-preload ./stream	

[STREAM
benchmark result] This page contains plots showing the performance results for a range of sizes running on the test machine; one of them appears to the right. Performance improvements range from 11.6% to 16.59% depending on the operation in use. Performance improvements would be typically lower for an X86 or X86-64 machine, likely in the 0% to 4% range.

3 SysBench

SysBench is a OnLine Transaction Processing (OLTP) benchmark representing a general class of workload where clients perform a sequence of operations whose end result must appear to be an indivisible operation. TPC-C is considered an industry standard for the evaluation of OLTP but requires significant capital investment and is extremely complex to set up. SysBench is a system performance benchmark comprising file I/O, scheduler, memory allocation, threading and includes an OLTP benchmark. The setup requirements are less complicated and SysBench works for MySQL, PostgreSQL, and Oracle databases.

PostgreSQL was used for this experiment on the grounds that it uses a shared memory segment similar to Oracle, making it a meaningful comparison with a commercial database server. Sysbench 0.4.12 and Postgres 8.4.0 were built from source.

Postgres was configured to use a 756MB shared buffer, an effective cache of 150MB, a maximum of 6*NR_CPUs clients were allowed to connect. Note that the maximum number of clients allowed is greater than the number of clients used in the test. This is because a typical configuration would allow more connections than the expected number of clients to allow administrative processes to connect. The update_process_title parameter was turned off as a small optimisation. Options that checkout, fsync, log, or synchronise were turned off to avoid interference from I/O. The system was configured to allow the postgres user to use huge pages with shmget() as described in part 3. Postgres uses System V shared memory so pg_ctl was invoked as follows.

   $ hugectl --shm bin/pg_ctl -D `pwd`/data -l logfile start

For the test itself, the table size was 10 million rows, read-only to avoid I/O and the test type was “complex”, which means each operation by the client is a database transaction. Tests were run varying the number of clients accessing the database from one to four times the number of CPU cores in the system. For each thread count, the test was run multiple times until at least five iterations completed with a confidence level of 99% that the estimated mean is within 2% of the true mean. In practise, the initial iteration gets discarded due to increased I/O and faults incurred during the first run.

[SysBench
benchmark result] The plot to the right (click for larger version) shows the performance results for different numbers of threads with performance improvements ranging in the 1%-3.5% mark. Unlike STREAM, the performance improvements would tend to be similar on X86 and X86-64 machines running this particular test configuration. The exact reasoning for this is beyond the scope of the article but it comes down to the fact that STREAM exhibits a very poor locality of reference, making cache behaviour a significant factor in the performance of the workload. As workloads would typically have a greater degree of reference locality than STREAM, the expectation would be that performance gains across different architectures would be similar.

4 SpecCPU 2006

SpecCPU 2006 v1.1 is a standardised CPU-intensive benchmark used in evaluations for HPC that also stresses the memory subsystem. A --reportable run was made comprising “test”, “train”, and three “ref” sets of input data. Three sets of runs compare base pages, huge pages backing just the heap, and huge pages backing text, data, and the heap. Only base tuning was used with no special compile options other than what was required to compile the tests.

To back the heap using huge pages, the tests were run with:

    hugectl --heap runspec ...

To also back the text and data, the SPEC configuration file was modified to build SPEC similar to STREAM described above, then the --text --data --bss switches were also specified to hugectl.

[SpecCPU
benchmark result] This plot shows the performance results running the integer SpecCPU test (click for full size and the floating-point test results). As is clear, there are very large fluctuations depending on what the reference pattern of the workload was but many of the improvements are quite significant averaging around 13% for the Integer benchmarks and 7-8% for the floating-point operations. An interesting point to note is that for the Fortran applications, performance gains were similar whether text/data was backed or the heap. This heavily implies that the Fortran applications were using dynamic allocation. On older Fortran applications, relinking to back the text and data with huge pages may be required to see any performance gains.

5 SpecJVM (JVM/General)

Java is used in an increasing number of scenarios, including real time systems, and it dominates in the execution of business-logic related applications. Particularly within application servers, the Java Virtual Machine (JVM) uses large quantities of virtual address space that can benefit from being backed by huge pages. SpecJVM 2008 is a benchmark suite for Java Runtime Environments (JRE). According to the documentation, the intention is to reflect the performance of the processor and memory system with a low dependence on file or network I/O. Crucially for HPC, it includes SCIMark, which is a Java benchmark for scientific and numerical computing.

The 64-bit version of IBM Java Standard Edition Version 6 SP 3 was used, but support for huge pages is available in other JVMs. The JVM was configured to use a maximum of 756MB for the heap. Unlike the other benchmarks, the JVM is huge-page-aware and uses huge-page-backed shared memory segments when -Xlp is specified. An example invocation of the benchmark is as follows.

   $ java -Xlp -Xmx756m -jar SPECjvm2008.jar 120 300 --parseJvmArgs -i 1 --peak

[SpecJVM
benchmark result] This plot shows the performance results running the full range of SpecJVM tests. The results are interesting as they show performance gains were not universal, with the serial benchmark being spectacularly poor. Despite this, performance was improved on average by 4.43% with very minimal work required on behalf of the administrator.

6 Summary

In this installment, it was shown that with minimal amounts of additional work, huge pages can be easily used to improve benchmarks. For the database and JVM benchmarks, the same configurations could easily be applied to a real-world deployment rather than as a benchmarking situation. For other benchmarks, the effort can be hidden with minimal use of initialisation scripts. Using huge pages on Linux in the past was a tricky affair but these examples show this is no longer the case.


(Log in to post comments)

Huge pages part 4: benchmarking with huge pages

Posted Apr 8, 2013 23:03 UTC (Mon) by tdaitx (subscriber, #56964) [Link]

I had to use a slightly different specjvm2008 invocation:
    $ java -Xlp -Xmx756m -jar SPECjvm2008.jar -wt 120 -it 300 --parseJvmArgs -i 1 --peak
And for those trying to install specjvm2008 from the console (tip stolen from [1]):
    $ java -jar SPECjvm2008_1_01_setup.jar -i console
[1] http://stackoverflow.com/questions/9257835/cant-install-specjvm2008

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds