HPL Benchmark Test Implementation on Georgia Tech PACE-ICE Cluster

10 Nov 2019

Reading time ~12 minutes

HPLinpack: Dense Linear Algebra Test on PACE-ICE Cluster

HPL - A Portable Implementation of the High-Performance Linpack, Benchmark for Distributed-Memory Computers
And the versioning is:
HPL 2.3 - by Antoine Petitet, Clint Whaley, Jack Dongar

HPL is a High-Performance Linpack benchmark implementation.

HPL is written in a portable ANSI C and requires an MPI implementation as well as either BLAS or VSIPL library. Such choice of software dependencies gives HPL both portability and performance.

The HPL package provides a testing and timing program to quantify the accuracy of the obtained solution as well as the time it took to compute it. The best performance achievable by this software on your system depends on a large variety of factors. Nonetheless, with some restrictive assumptions on the interconnection network, the algorithm described here and its attached implementation are scalable in the sense that their parallel efficiency is maintained constant with respect to the per processor memory usage.

In our implementations for this project,

All the following packages are included in the module mkl/19.0.

HPL rely on an efficient implementation of the Basic Linear Algebra Subprograms (BLAS) (http://www.netlib.org/blas/blas-3.8.0.tgz)
And MPI module we already had as mvapich2 2.3, a MPI based software package, alternative versionings can be found at (http://mvapich.cse.ohio-state.edu/downloads/)

Problems to be solved:

The code solves a uniformely random system of linear equations and reports time and floating-point execution rate using a standard formula for operation count. Specifically, HPL generates a linear system of equations of order n and solves it using LU decomposition with partial row pivoting.
For N Problem size, it will need (2/3 * N^3－2N^2) steps to compute, and it measure the computational time performing such (2/3 * N^3－2N^2) steps operations. And then calculated the FLOPs.
Since HPL performs computation on an N x N array of Double Precision (DP) elements, and that each double precision element requires sizeof(double) = 8 bytes, the memory consumed for a problem size of N is 8N^2.
NB=192 for the broadwell processors.
P and Q, knowing that the product P x Q SHOULD typically be equal to the number of MPI processes.

Here are the parameters:

T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

Values of the benchmark:

The HPL benchmark is used as reference benchmark to provide data for the Top500 list and thus rank to supercomputers worldwide.It has the testing program to quantify the accuracy of the obtained solution as well as the time it took to compute it. Generally It reports the computation time per node, and the FLOPS rate.

Nodes on pace-ice cluster intended to use:

We chose the nodes on pace-ice clusters with the CPU only.
They are the ones with 28 CPU cores (14 physical cores), Model is Intel(R) Xeon(R) CPU E5-2680 v4 @2.40GHz machines.
We intended to perform benchmark tests on one node and four nodes.

The balance of being to specific to one application but very predictive, versus being general to lots of applications while being too simple to predict the performance of any application very well:

The HPLinpack (HPL) as a scalable addition to the original Linpack benchmark suites, due to its parallel computation efficiency remaining constant respect to per process memory usage, succeeded in scalability. And it outputs only one single number in time and FLOPS, making this benchmark results are easily compared with different machines.
In this sense, the benchmark is very good at being general to lots of applications.
However, it only tests the dense linear system, omitted a fact that there are many more other operations can be taken into account to the performance evaluation. Thus, the best results can be achieved by fine tuning for a particular machine targeting to solving the dense linear system, which caused the results shown are biased.
That is to say, the benchmark is too simple for all kinds of applications testing, but can be very predictive for machines or applications specificially for dense linear system problem solving.

Source code:

The source code can be obtained from (http://www.netlib.org/benchmark/hpl/hpl-2.3.tar.gz) as a tar, and the dependency packages are BLAS (http://www.netlib.org/blas/blas-3.8.0.tgz) and mpi (https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.2.tar.gz). However, in practice, we directly loaded the module installed on PACE. I have documented a step by step workflow for how to successfully install and run the benchmark.

Dependencies and Enviroment

The benchmark needs to be run on a computing node, the following modules need to be loaded:

module use $CSE6230_DIR/modulefiles
module load cse6230/core
module load cse6230/mkl/19.0

The benchmark can be made locally once downloaded, and it required MPI packages to successfully run. In our case, we already have the MPI software installed.

Workflow

Step 1 Benchmark installation:

Create a directory named “hpl” under “$CSE6230_DIR”/final, to contain all the neccessary files for the benchmark.

About Downloading the packages:

hpl-2.3.tar.gz can be downloaded at (http://www.netlib.org/benchmark/hpl/hpl-2.3.tar.gz)
the benchmark tool is hpl-2.3, but blas will be one of the neccessary package to successfully make the executionables BLAS. In our project, the linear algebra package is contained with the intel mkl module, we do not neccessarily install it from scratch.

About building the tool packages:

Load Modules:

It is always needed to load the neccessary modules first, here to ease the touble, we always load :

module load cse6230/core
module load mkl/19.0

HPL:

After downloading and uploads the packages to the final/hpl folder:

mguo34@coc-ice:~/cse6230/final/hpl$ tar -xzvf hpl-2.3.tar.gz

mguo34@coc-ice:~/cse6230/final/hpl$ cd hpl-2.3/setup

To generate a template:

mguo34@coc-ice:~/cse6230/final/hpl/hpl-2.3/setup$ sh make_generic

Then have our make file named as Make.Linux, and move it to the higher level directory

mguo34@coc-ice:~/cse6230/final/hpl/hpl-2.3/setup$ cp Make.UNKNOWN ../Make.Linux

And go back to hpl/hpl-2.3/ directory
mguo34@coc-ice:~/cse6230/final/hpl/hpl-2.3/setup$ cd ..

Now we have a Make.Linux template to work on.

mguo34@coc-ice:~/cse6230/final/hpl/hpl-2.3$ ls
acinclude.m4  BUGS          config.sub    COPYRIGHT  INSTALL     Makefile.am  Make.top  README   THANKS
aclocal.m4    ChangeLog     configure     depcomp    install-sh  Makefile.in  man       setup    TODO
AUTHORS       compile       configure.ac  HISTORY    lib         Make.Linux   missing   src      TUNING
bin           config.guess  COPYING       include    Makefile    makes        NEWS      testing  www

Open with any editors, it should be showing a template, yet not completed, we need add dependecy patch to make sure the tool can be successfully built. it will need mvapich2, BLAS, and the hpl

mguo34@coc-ice:~/cse6230/final/hpl/hpl-2.3$ vi Make.Linux

Linear Algebra Package:

The linear algebra package is already contained within the mkl/19.0 module, we will simply load this module, and modify the dependecy in Make.Linux file, like the description below:

MODIFY MAKE FILE:

After successfully get all the neccessary files, we modified package paths and complier/linkers choice in Make.Linux config file:

TOPdir       = $(HOME)/cse6230-hw/final/hpl/hpl-2.3
MPdir        = /usr/local/pacerepov1/intel/compiler/16.0/impi/5.1.1.109
MPinc        = -I $(MPdir)/include64
MPlib        = $(MPdir)/intel64/lib/libmpi_mt.so
Adir        = $(MKLROOT)
ifndef  LAinc
LAinc        = -I $(LAdir)/include/
endif
ifndef  LAlib
LAlib        = -L$(LAdir)/lib/intel64 \
               -Wl,--start-group \
               $(LAdir)/lib/intel64/libmkl_intel_lp64.a \
               $(LAdir)/lib/intel64/libmkl_intel_thread.a \
               $(LAdir)/lib/intel64/libmkl_core.a \
               -Wl,--end-group -lpthread -ldl
endif

CC       = mpiicc
CCNOOPT  = $(HPL_DEFS)
OMP_DEFS = -qopenmp
CCFLAGS  = $(HPL_DEFS) -O3 -w -ansi-alias -i-static -z noexecstack -z relro -z now -nocompchk
LINKER       = $(CC)
LINKFLAGS    = $(CCFLAGS) $(OMP_DEFS) -mt_mpi

Compile into executable:

Now go to the hpl-2.3 directory and start compiling:
First Make clean:
mguo34@coc-ice:~/cse6230/final/hpl/hpl-2.3$ make arch=Linux clean_arch_all
Then Make:
mguo34@coc-ice:~/cse6230/final/hpl/hpl-2.3$ make arch=Linux

Step 2 load modules and run a test:

module load cse6230/core
module load mkl/19.0

CPU=$(cat /proc/cpuinfo | grep "model name" | tail -1)
COUNT=$(cat /proc/cpuinfo | grep processor | wc -l)
MEM=$(cat /proc/meminfo |grep MemTotal | tail -1)
echo "CPU : $CPU"
echo "Total CPU Cores : $COUNT"
echo "$MEM"

CPU : model name	: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
Total CPU Cores : 28
MemTotal:       132176424 kB

According to HPL suggestion, a good N should makes the program use 90% of total memory, which is $\sqrt{(130844220*1024/8)}*0.9=116,472$. However, for testrun purpose, it takes too long to compute. So we selected a N=19200 and N=38400, which is about ten and five times smaller than that. For Block size, we also followed the recommendation to choose a value in the [32 , 256] interval. In our example, we selected 64 and 192. Because the example test was performed on one node (chip), each core can be considered closely connected. Hence, 4 by 7 grid was set.

When submitting to compute node with a script ("$CSE6230_DIR"/final/script/hpl_singleNode.sh**), we tested from 19200 to 153600 to see how different size (Ns) affect the performance. **Example job submission script can be found in “$CSE6230_DIR”/final/script/hpl_example.sh`.

# testrun (for simplicity, only partial test is shown below)
cd /nv/coc-ice/zjiang333/cse6230-hw/final/hpl/hpl-2.3/bin/Linux/
mpirun -np 28 ./xhpl 

================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   19200    38400 
NB     :      64      192 
PMAP   : Row-major process mapping
P      :       4 
Q      :       7 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       0 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               2.220446e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR01C2R4       19200    64     4     7               8.38             5.6287e+02
HPL_pdgesv() start time Wed Dec 11 15:32:05 2019

HPL_pdgesv() end time   Wed Dec 11 15:32:14 2019

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   1.57135916e-03 ...... PASSED
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR01C2R4       19200   192     4     7               6.31             7.4785e+02
HPL_pdgesv() start time Wed Dec 11 15:32:14 2019

HPL_pdgesv() end time   Wed Dec 11 15:32:21 2019

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   1.73935730e-03 ...... PASSED
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR01C2R4       38400    64     4     7              60.83             6.2057e+02
HPL_pdgesv() start time Wed Dec 11 15:32:22 2019

HPL_pdgesv() end time   Wed Dec 11 15:33:23 2019

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   1.37495714e-03 ...... PASSED
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR01C2R4       38400   192     4     7              43.81             8.6164e+02
HPL_pdgesv() start time Wed Dec 11 15:33:25 2019

HPL_pdgesv() end time   Wed Dec 11 15:34:09 2019

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   1.34900801e-03 ...... PASSED
================================================================================

Finished      4 tests with the following results:
              4 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

step 3 gather statistics with a performance metric (example shown here, results plots are not included)

A shell was created here below:

#!/bin/sh

#PBS -N hpl_fullscale
### request four node with 112 cores
#PBS -l nodes=4:ppn=28,walltime=00:30:00
#PBS -q coc-ice-multi

module load cse6230/core
module load mkl/19.0

cd "$HOME"/cse6230-hw/final/hpl/hpl-2.3/bin/Linux/
### run example with 112 cores
cp HPL_fullscale.dat HPL.dat
mpirun -np 112 ./xhpl | tee "$HOME"/cse6230-hw/final/fullscale.txt

To ensure the quality of statistics, we have run the benchmark at the same setting for five times. Below is an example of running HPL benchmark on four nodes.

================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   19200    38400    57600    76800    96000   115200 
NB     :     192 
PMAP   : Row-major process mapping
P      :       8 
Q      :      14 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       0 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               2.220446e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR01C2R4       19200   192     8    14               2.67             1.7684e+03
HPL_pdgesv() start time Wed Dec 11 16:12:52 2019

HPL_pdgesv() end time   Wed Dec 11 16:12:55 2019

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   1.64198128e-03 ...... PASSED
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR01C2R4       38400   192     8    14              14.03             2.6908e+03
HPL_pdgesv() start time Wed Dec 11 16:12:55 2019

HPL_pdgesv() end time   Wed Dec 11 16:13:09 2019

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   1.22453577e-03 ...... PASSED
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR01C2R4       57600   192     8    14              41.69             3.0563e+03
HPL_pdgesv() start time Wed Dec 11 16:13:10 2019

HPL_pdgesv() end time   Wed Dec 11 16:13:52 2019

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   1.34435654e-03 ...... PASSED
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR01C2R4       76800   192     8    14              94.65             3.1905e+03
HPL_pdgesv() start time Wed Dec 11 16:13:54 2019

HPL_pdgesv() end time   Wed Dec 11 16:15:28 2019

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   9.74225766e-04 ...... PASSED
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR01C2R4       96000   192     8    14             177.22             3.3282e+03
HPL_pdgesv() start time Wed Dec 11 16:15:31 2019

HPL_pdgesv() end time   Wed Dec 11 16:18:28 2019

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   8.39070264e-04 ...... PASSED
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR01C2R4      115200   192     8    14             299.85             3.3992e+03
HPL_pdgesv() start time Wed Dec 11 16:18:32 2019

HPL_pdgesv() end time   Wed Dec 11 16:23:32 2019

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   1.10942470e-03 ...... PASSED
================================================================================

Finished      6 tests with the following results:
              6 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

HPC

你好

Hello

안녕하세요 Bonjour

привет Hola

Namaste こんにちは Guten Tag