Lorenz benchmark
Code for testing compiler flags and effectiveness.
Code for benchmarking BLAS library performance.
Intro
Compiler performance benchmark
The lorenz-c
benchmark compares compiler output to optimal reference compilation.
Selecting appropriate compiler and compiler flags to ensure optimal code compilation to assembly (and in turn to machine code) may be a challenging task. This benchmark code is intended to help: it is constructed in such a way, that it is possible to derive the optimal assembly output manually. Therefore, optimal output is known beforehand and is provided as reference. Being able to reproduce the optimal output with own choice of compiler and flags gives a good start for efficient compilation of more complex codes.
The code implements Euler integration of the Lorenz system of ordinary differential equations. The equations relate the properties of a two-dimensional fluid layer uniformly warmed from below and cooled from above. The system exhibits so called strange attractor, thus it is locally unstable yet globally stable: once some sequences have entered the attractor, nearby points diverge from one another but never depart from the attractor. For this reason, the system remains bounded indefinitely, makig it suitable for bencharking. Further advantage is that the system is very simple and easily yealds to vectorization when multiple starting points are integrated.
The benchmark runs on single core.
BLAS library performance benchmark
The lorenz-blas
benchmark provides a simple code capable to run BLAS DGEMM over and over indefinitely.
The BLAS library kernels, in particular the BLAS 3 matrix multiplication DGEMM kernel often accounts for majority of computer time spent in scientific applications. The BLAS libary is available in many implementations by many vendors, for example the Intel MKL, Project ATLAS, IBM ESSL, Cray LibSci, AMD BLIST, GNU OpenBLAS, GNU libblas to name a few. Proper configuration and use of BLAS libraries may have great impact on application performance.
The code implements generalized Euler integration, where the x, y, z quantities are replaced by square matrices initialized to random values. Due to the presence of the strange attractor, the iterations over matrices remain bounded indefinitely, makig the code suitable for bencharking. The matrix size and the number of iterations may be chosen freely to reach reasonable runtimes.
To fully utilize the CPU, the BLAS library should run multi-threaded.
Source
lorenz-c.c
: c language compiler benchmark
lorenz-avx.s
: x86_64 AVX architecture reference assembly
lorenz-avx2.s
: x86_64 AVX2 architecture reference assembly
lorenz-avx512.s
: x86_64 AVX-512 architecture reference assembly
cpuid.c
: find out about the processor architecture
lorenz-blas.c
: BLAS library benchmark
Build
Make Build
Default build:
$ make
Build lorenz-c
benchmark with own compiler and flags:
$ make CC=mycc CFLAGS='-my -flags'
Build lorenz-blas
benchmark with own BLAS library:
$ make lorenz-blas.x LIBS='-my -blas'
# f.x.
$ make lorenz-blas.x CC=icc CFLAGS='-O3 -xCORE-AVX2' LIBS='-mkl'
Manual Build
- find out the processor architecture:
$ gcc cpuid.c -o cpuid.x
./cpuid.x
- Select appropriate source and build the reference code, for example:
$ gcc lorenz-avx2.s -o lorenz-avx2.x
- Build the
lorenz-c
benchmark, using compiler and flags of choice:
$ cc -S $CFLAGS lorenz-c.c
$ cc lorenz-c.s -o lorenz-c.x
- Build the
lorenz-blas
benchmark, using BLAS library of choice:
$ cc -fopenmp $CFLAGS lorenz-blas.c -o lorenz-blas.x $LIBS
# f.x.
$ icc -fopenmp -xCORE-AVX2 lorenz-blas.c -o lorenz-blas.x -mkl
$ gcc -fopenmp -mavx2 -O3 lorenz-blas.c -o lorenz-blas.x -lblas
Run
Compare the lorenz-c.s
output with the reference. In optimal case, you should see identical instruction pattern.
Execute the benchmark and compare the timing to the reference. In optimal case, you should see very comparable timing. For example:
$ time ./lorenz-c.x
real 0m5.659s
$ time ./lorenz-avx2.x
real 0m5.654s
Run the lorenz-blas
selecting the matrix size and number of iteration,
compare the timing for the different BLAS libraries. In optimal case, you should see very comparable timing.
For example: Matrix size 5000x5000, 10 lorenz iterations (20 matmuls in total)
$ time ./lorenz-blas.x 5000 10
Benchmarks
lorenz-c
compiler benchmark
Processor | inst. set | time [s] |
---|---|---|
Xeon Gold 6240, 2.60GHz | AVX | 7.10 |
Xeon Gold 6240, 2.60GHz | AVX2 | 4.44 |
Xeon Gold 6240, 2.60GHz | AVX512 | 4.38 |
lorenz-blas
BLAS benchmark
Matrix size 12000, 10 iterations
Processor | BLAS | time [s] | Gflop/s |
---|---|---|---|
2x Xeon Gold 6240, 2.60GHz | Intel MKL 2020.4.304 | 34.2 | 2068.4 |