Skip to content
Snippets Groups Projects

Lorenz benchmark

Code for testing compiler flags and effectiveness.
Code for benchmarking BLAS library performance.

Intro

Compiler performance benchmark

The lorenz-c benchmark compares compiler output to optimal reference compilation.

Selecting appropriate compiler and compiler flags to ensure optimal code compilation to assembly (and in turn to machine code) may be a challenging task. This benchmark code is intended to help: it is constructed in such a way, that it is possible to derive the optimal assembly output manually. Therefore, optimal output is known beforehand and is provided as reference. Being able to reproduce the optimal output with own choice of compiler and flags gives a good start for efficient compilation of more complex codes.

The code implements Euler integration of the Lorenz system of ordinary differential equations. The equations relate the properties of a two-dimensional fluid layer uniformly warmed from below and cooled from above. The system exhibits so called strange attractor, thus it is locally unstable yet globally stable: once some sequences have entered the attractor, nearby points diverge from one another but never depart from the attractor. For this reason, the system remains bounded indefinitely, makig it suitable for bencharking. Further advantage is that the system is very simple and easily yealds to vectorization when multiple starting points are integrated.

The benchmark runs on single core.

BLAS library performance benchmark

The lorenz-blas benchmark provides a simple code capable to run BLAS DGEMM over and over indefinitely.

The BLAS library kernels, in particular the BLAS 3 matrix multiplication DGEMM kernel often accounts for majority of computer time spent in scientific applications. The BLAS libary is available in many implementations by many vendors, for example the Intel MKL, Project ATLAS, IBM ESSL, Cray LibSci, AMD BLIST, GNU OpenBLAS, GNU libblas to name a few. Proper configuration and use of BLAS libraries may have great impact on application performance.

The code implements generalized Euler integration, where the x, y, z quantities are replaced by square matrices initialized to random values. Due to the presence of the strange attractor, the iterations over matrices remain bounded indefinitely, makig the code suitable for bencharking. The matrix size and the number of iterations may be chosen freely to reach reasonable runtimes.

To fully utilize the CPU, the BLAS library should run multi-threaded.

Source

lorenz-c.c: c language compiler benchmark
lorenz-avx.s: x86_64 AVX architecture reference assembly
lorenz-avx2.s: x86_64 AVX2 architecture reference assembly
lorenz-avx512.s: x86_64 AVX-512 architecture reference assembly
cpuid.c: find out about the processor architecture

lorenz-blas.c: BLAS library benchmark

Build

Make Build

Default build:

$ make

Build lorenz-c benchmark with own compiler and flags:

$ make CC=mycc CFLAGS='-my -flags'

Build lorenz-blas benchmark with own BLAS library:

$ make lorenz-blas.x  LIBS='-my -blas'
# f.x.
$ make lorenz-blas.x CC=icc CFLAGS='-O3 -xCORE-AVX2' LIBS='-mkl'

Manual Build

  1. find out the processor architecture:
$ gcc cpuid.c -o cpuid.x
./cpuid.x
  1. Select appropriate source and build the reference code, for example:
$ gcc lorenz-avx2.s -o lorenz-avx2.x
  1. Build the lorenz-c benchmark, using compiler and flags of choice:
$ cc -S $CFLAGS lorenz-c.c
$ cc lorenz-c.s -o lorenz-c.x
  1. Build the lorenz-blas benchmark, using BLAS library of choice:
$ cc -fopenmp $CFLAGS lorenz-blas.c -o lorenz-blas.x $LIBS
# f.x.
$ icc -fopenmp -xCORE-AVX2 lorenz-blas.c -o lorenz-blas.x -mkl
$ gcc -fopenmp -mavx2 -O3 lorenz-blas.c -o lorenz-blas.x -lblas

Run

Compare the lorenz-c.s output with the reference. In optimal case, you should see identical instruction pattern.

Execute the benchmark and compare the timing to the reference. In optimal case, you should see very comparable timing. For example:

$ time ./lorenz-c.x
real	0m5.659s
$ time ./lorenz-avx2.x
real	0m5.654s

Run the lorenz-blas selecting the matrix size and number of iteration, compare the timing for the different BLAS libraries. In optimal case, you should see very comparable timing.

For example: Matrix size 5000x5000, 10 lorenz iterations (20 matmuls in total)

$ time ./lorenz-blas.x 5000 10

Benchmarks

lorenz-c compiler benchmark

Processor inst. set time [s]
Xeon Gold 6240, 2.60GHz AVX 7.10
Xeon Gold 6240, 2.60GHz AVX2 4.44
Xeon Gold 6240, 2.60GHz AVX512 4.38

lorenz-blas BLAS benchmark

Matrix size 12000, 10 iterations

Processor BLAS time [s] Gflop/s
2x Xeon Gold 6240, 2.60GHz Intel MKL 2020.4.304 34.2 2068.4