Skip to content
Snippets Groups Projects

Mandelbrot benchmark

Processor benchmark. Measures pure floating point performance of the processor (x86 CPU, ARM CPU, Power CPU, NVidia GPU and AMD GPU).
Code by Branislav Jansik, IT4Innovations

Intro

The Mandelbrot benchmark measures pure floating point performance of the processor (x86 CPU,ARM CPU, Power CPU, NVidia GPU and AMD GPU). All calculations are on registers only, no memory or disk access. Very close to peak floating point performance is sustained, nominal peak may be exceeded. The code puts extreme load on processor (CPU or GPU), forcing operation at TDP. Thermal throttling may be observed.

The code is implemented in assembly language. Fused multiply add (FMA) vector instructions or warp matrix-multiply-add (WMMA) instructions are executed. Instruction sets supported : x86 (SSE), x86_64(SSE, AVX, AVX2, AVX-512), ARM(AArch64 NEON, SVE), cuda(PTX), amd(RDNA), Power (ppc64/ppc64le VSX).

Optimized to perfectly fill the instruction pipeline, up to two instructions are retired every clock cycle on x86 architectures.

The x86/AArch64/Power code is parallelized via MPI and OpenMP, the cuda(PTX) and amd(RDNA) via asynchronous kernel launch and asynchronous device access. Every MPI process, OMP thread or GPU device runs exactly the same workload.

Results are obtained in GFLIPS (Giga (

10^9
) Floating Point Instructions Per Second).
Convert to Gflop/s (Giga floating point operations per second) by applying a factor according to the instruction set and precision:

inst. set precision format factor
SSE single FP32 4
SSE double FP64 2
AVX double FP64 4
AVX2 FMA double FP64 8
AVX-512 FMA double FP64 16
PTX FMA half FP16 2
PTX FMA half2 FP16 4
PTX FMA single FP32 2
PTX FMA double FP64 2
RDNA FMA single FP32 2
RDNA FMA double FP64 2
CDNA MFMA single FP32 1280
PTX WMMA half FP16 8192
PTX WMMA double FP64 512
SVE-128 FMA double FP64 4
SVE-512 FMA double FP64 16
NEON FMLA double FP64 4
NEON double FP64 2
VSX FMA double FP64 4
VSX GER double FP64 16

Theory

The benchmark runs the Mandelbrot iterations. Such iterations remain bounded indefinitely, which allows the benchmark to execute caluclations on meaningful numbers (no Inf, NaN, etc.) as long as it takes to accumulate desired runtime.

The Mandelbrot iterations are defined as

z_{k+1}=z^2_k + c
where
z_0 = 0
and the constant
c
is from the Mandelbrot set of complex numbers. For the NVidia GPU tensor benchmark, we define generalized mandelbrot iterations as
Z_{k+1}=Z_kZ_k + C
, where the
Z_0
is zero matrix and the
C
matrix has eigenvalues within the Mandelbrot set.

Source

Use the following source files according to instruction set and precision:

source inst. set precision arch. comment
mandelbrot-real-sse-mpi-dump.c SSE single x86_64 Nehalem. Uses SSE 64 bit MUL and ADD instructions. Also availeble in 32bit SSE variant for x86.
mandelbrot-real-avx-mpi-dump.c AVX double x86_64 Sandybridge. Runs MUL and ADD instructions.
mandelbrot-real-fma-mpi-mem4-dump.c AVX2 FMA double x86_64 Haswell. Runs FMA. 4 steps to cache for optimal pipelining.
mandelbrot-real-fma-mpi-dump-mic.c AVX-512 FMA double x86_64 Works for MIC, SKX and KNL architectures incl. Xeon Phi, Skylake. Runs FMA.
mandelbrot-real-fma-ptx-dump.cu PTX FMA double NVIDIA sm_53 Uses Nvidia PTX FMA instuctions. Optimized for Nvidia K20 peak performance. Set NBLOCKS and NTHREADS accordingly for other devices.
mandelbrot-real-wmma-ptx-f16-dump.cu PTX WMMA half NVIDIA sm_70 Uses PTX WMMA (Warp Matrx-matrix Multiply Add) instructions, targeting NVIDIA V100 tensor cores. Optimized for Nvidia V100 peak performance. Set NBLOCKS and NTHREADS accordingly for other devices.
mandelbrot-real-wmma-ptx-f64-dump.cu PTX WMMA double NVIDIA sm_80 Uses PTX WMMA (Warp Matrx-matrix Multiply Add) instructions, targeting NVIDIA A100 tensor cores. Optimized for Nvidia A100 peak performance. Set NBLOCKS and NTHREADS accordingly for other devices.
mandelbrot-real-fma-rdna-f64-dump.cpp RDNA FMA double AMD RDNA Uses RDNA FMA instuctions (AMD Radeon GPUs). Optimized for AMD Radeon MI100. Set NBLOCKS and NTHREADS accordingly for other devices. Also available in FP32.
accumulator-mfma-cdna-f32.cpp CDNA MFMA double AMD CDNA Uses AMD Instinct MI100 ISA MFMA 16x16x4 matrix multiplication instuctions. Optimized for AMD Radeon MI100. Set NBLOCKS and NTHREADS accordingly for other devices.
mandelbrot-real-fma-sve-f64-omp.c SVE FMA double AArch64 ARM AArch64. Supports vector-length agnostic SVE FMA instructions. The code automatically adapts to different SVE register vector lengths.
jansik-real-fmla-neon-f64-omp.c NEON FMLA double AArch64 ARM AArch64. Supports ARM NEON (128bit) FMLA instructions. The code runs Mandelbrot inspired Jansik iterations to accomodate FMLA semantics.
mandelbrot-real-neon-f64-omp.c NEON double AArch64 ARM AArch64. Runs ARM NEON (128bit) FMUL and FADD instructions.
mandelbrot-real-fma-power-f64-omp.c VSX FMA double Power ppc64/ppc64le Mandelbrot variant for the OpenPOWER Power ISA architecture. Executes Power VSX (128bit) FMA instructions.
accumulator-ger-power-f64-omp.c VSX GER double Power ppc64/ppc64le Code executing Power VSX (128bit) rank-2 update GER instructions. These compute outer product between 4x1 and 2x1 double precision vectors and store results in dedicated accumulator registers.
cpuid.c x86 CPUID N/A x86 Runs CPUID instruction. Find out about x86 CPU capabilities.

Build

Make Build

Requires gcc or icc compiler, optionally MPI and CUDA.

$ make
Usage: make [icc] target [target [target] ... ]

For example

$ make icc omp mpi
$ make mandelbrot-real-fma-ptx-dump.x NBLOCKS=1296 NTHREADS=32 # Compile for A100

Manual Build

As the code is in assembly, the choice of compiler or optimization flags does not matter.
Choose the appropriate source file and compile:

$ mpicc filename-mpi-.c -o filename.x  
$ icc -qopenmp filename-omp-.c -o filename.x  
$ gcc -fopenmp filename-omp-.c -o filename.x -lm  
$ nvcc filename.cu -o filename.x
$ hipcc filename.cpp -o filename.x 

For WMMA build, see the source header.

Some older compilers may complain about the ymm or zmm registers in the clobbers list. It is safe to comment out this section of the code.

Windows binaries

For convenience, we provide some MinGW Windows compiled executables.
cpuid.exe
mandelbrot-real-sse-omp32.exe
mandelbrot-real-sse-omp.exe
mandelbrot-real-avx-omp.exe
mandelbrot-real-fma-omp.exe
mandelbrot-real-fma512-omp.exe

Run in Command Prompt or PowerShell. Windows Defender and other antivir software
may flag the executables as threat. This is a false positive.
sha256sum.txt

Run

Run as MPI, OpenMP or GPU executable:

$ mpirun -n number_of_cores ./filename.x [number_of_repetitions] 
$ OMP_PROC_BIND=true ./filename-omp.x [number_of_repetitions] 
$ ./filename.x [number_of_repetitions]

The OMP variant will use all available CPU cores, the GPU variants will use all available GPUs. Control the execution by setting OMP_NUM_THREADS, CUDA_VISIBLE_DEVICES or HIP_VISIBLE_DEVICES environment variables:

$ export OMP_NUM_THREADS=4      # f.x. run on 4 cores only.
$ export CUDA_VISIBLE_DEVICES=0 # f.x. run on device no. 0 only

Benchmarks

Processor inst. set precision GFLIPS Gflop/s
Atom N280, 1.66GHz SSE single 1.6 6.4
i3-2370M, 2.40GHz AVX double 9.5 38.0
Broadcom BCM2712 ARM Cortex-A76, 4c, 2.4GHz
(Raspberry Pi 5)
NEON FMLA double 19.2 76.8
i5-3470, 3.20GHz AVX double 26.5 106.0
Samsung Exynos 2200, 8c SoC, 1.8-2.8GHz
(Samsung Galaxy S22)
SVE-128 FMA double 26.7 106.8
i7-10510U, 2.30GHz AVX2 FMA double 17.7 141.6
Apple M3, 8c, 4.05GHz NEON double 73.3 146.6
i5-6300HQ, 2.30GHz AVX2 FMA double 19.2 153.6
i7-10510U, 1.80GHz AVX2 FMA double 22.2 177.6
i7-1068NG7, 2.30GHz AVX-512 FMA double 12.3 196.8
Nvidia GTX 1080Ti, 1.58GHz PTX FMA half2 50.5 202.0
i7-4790, 3.60GHz AVX2 FMA double 28.3 226.4
Nvidia Tesla T4, 1.59GHz PTX FMA double 127.3 254.6
Apple M3, 8c, 4.05GHz NEON FMLA double 72.6 290.4
2x E5-2470, 2.30GHz AVX double 84.0 336.0
2x E5-2665, 2.40GHz AVX double 88.3 353.2
i5-9400F, 2.9GHz AVX2 FMA double 46.2 369.6
Nvidia GTX 1080Ti, 1.58GHz PTX FMA double 201.6 403.2
AMD Ryzen 7 PRO 7840U, 3.30GHz AVX2 FMA double 51.5 412.0
AMD Ryzen 7 PRO 7840U, 3.30GHz AVX-512 FMA double 26.4 422.4
Xeon D-1587, 1.70GHz AVX2 FMA double 54.2 433.6
Nvida Quadro RTX 6000, 1.77GHz PTX FMA double 266.0 532.0
i5-11400F, 2.60GHz AVX-512 FMA double 35.5 568.0
2x ARM Cortex-A72, 32c, 2.0GHz NEON FMLA double 152.0 608.0
2x Opteron 6376, 2.3GHz AVX2 FMA double 76.6 612.8
2x EPYC 7351, 2.40GHz AVX double 181.3 725.2
2x EPYC 7351, 2.40GHz AVX2 FMA double 91.0 728.0
2x E5-2680v3, 2.50GHz AVX2 FMA double 125.7 1005.6
Nvidia K20, 0.71GHz PTX FMA double 586.2 1172.4
Xeon Phi 7120P, 1.24GHz AVX-512 FMA double 74.8 1196.8
2x EPYC 7513, 2.60GHz AVX double 308.9 1235.6
2x EPYC 7513, 2.60GHz AVX2 FMA double 157.1 1256.8
Nvidia MX330, 1.59GHz PTX FMA single 648.8 1297.6
Nvidia K20Xm, 0,73GHz PTX FMA double 654.1 1308.2
Nvidia K40c, 0.75GHz PTX FMA double 703.1 1406.2
Nvidia GTX 960M, 1.18GHz PTX FMA single 748.2 1496.4
2x Xeon Gold 6126, 2.60GHz AVX-512 FMA double 109.6 1753.6
Nvidia Quadro K5000, 0.71GHz PTX FMA single 910.1 1820.2
ARM Ampere Altra Q80-30, 3.00GHz NEON FMLA double 479.0 1916.0
2x Xeon Gold 6130, 2.10GHz AVX-512 FMA double 121.2 1939.2
2x Xeon Gold 6138, 2.00GHz AVX-512 FMA double 148.5 2376.0
Fujitsu ARM A64FX, 2.00GHz SVE-512 FMA double 160.0 2560.0
Xeon Phi 7210, 1.30GHz AVX-512 FMA double 160.9 2574.4
2x Power10 12-CORE 2.90 TO 4.0 GHz
(IBM POWER S1022)
VSX FMA double 655.0 2620.0
2x Xeon Gold 6240, 2.60GHz AVX-512 FMA double 172.9 2766.4
Nvidia K20, 0.71GHz PTX FMA single 1532.9 3065.8
Nvidia K20Xm, 0,73GHz PTX FMA single 1706.9 3413.8
Nvidia K40c, 0.75GHz PTX FMA single 1734.8 3469.6
2x Xeon 8168, 2.70GHz AVX-512 FMA double 236.1 3777.6
2x Xeon Gold 6338, 2.00Ghz AVX-512 FMA double 256.1 4097.6
Nvidia Tesla P100-PCIE-12GB, 1.33GHz PTX FMA double 2380.0 4760.0
Nvidia A30, 1.44GHz PTX FMA double 2514 5028
2x EPYC 7763, 2.45GHz AVX2 FMA double 662.0 5296.0
2x EPYC 7H12, 2.60GHz AVX2 FMA double 664.7 5317.8
2x EPYC 7773X, 2.20GHz AVX2 FMA double 697.6 5580.8
2x Power10 12-CORE 2.90 TO 4.0 GHz
(IBM POWER S1022)
VSX GER double 360.5 5768.0
8x Xeon 8153, 2.00GHz (full X808 ) AVX-512 FMA double 392.0 6272.0
Nvidia Tesla T4, 1.59GHz PTX FMA single 3200 6400
Nvidia TITAN V, 1.46GHz PTX FMA double 3434.9 6869.8
Nvidia ARM Grace CPU Superchip, 3.1GHz SVE-128 FMA double 1840 7360
2x Xeon CPU Max 9468, 2.10GHz AVX-512 FMA double 475.0 7600.0
Nvidia V100-SXM3, 1.59GHz PTX FMA double 4088.1 8176.2
Nvidia Tesla P100-PCIE-12GB, 1.33GHz PTX FMA single 4737.2 9474.4
AMD Radeon MI100, 1.50GHz RDNA FMA double 4807.0 9614
Nvidia A100-SXM4-40GB, 1.41GHz PTX FMA double 4864.0 9728
Nvidia A30, 1.44GHz PTX WMMA double 19.0 9728
Nvidia A30, 1.44GHz PTX FMA single 5123 10246
Nvidia GTX 1080Ti, 1.58GHz PTX FMA single 6000.0 12000
Nvidia Tesla T4, 1.59GHz PTX FMA half2 3200 12800
Nvidia TITAN V, 1.46GHz PTX FMA single 6870.2 13740
Nvida Quadro RTX 6000, 1.77GHz PTX FMA single 8100.0 16200
Nvidia V100-SXM3, 1.59GHz PTX FMA half 8174.0 16348
Nvidia V100-SXM3, 1.59GHz PTX FMA single 8175.0 16350
Nvidia RTX 3060 Ti, 1.66GHz PTX FMA single 8407.7 16815
Nvidia RTX 3060 Ti, 1.66GHz PTX FMA half2 4292.9 17172
Nvidia Tesla P100-PCIE-12GB, 1.33GHz PTX FMA half2 4737.1 18948
Nvidia A100-SXM4-40GB, 1.41GHz PTX FMA single 9731.0 19462
Nvidia A100-SXM4-40GB, 1.41GHz PTX WMMA double 38.02 19466
AMD Radeon MI100, 1.50GHz RDNA FMA single 10499 20998
Nvidia Tesla T4, 1.59GHz PTX WMMA half 3.20 26214
Nvidia TITAN V, 1.46GHz PTX FMA half2 6863.8 27455
AMD Radeon MI100, 1.50GHz CDNA MFMA single 21.95 28096
Nvida Quadro RTX 6000, 1.77GHz PTX FMA half2 7800.0 31200
Nvidia V100-SXM3, 1.59GHz PTX FMA half2 8174.0 32696
Nvidia A30, 1.44GHz PTX FMA half2 8700 34800
Nvidia A100-SXM4-40GB, 1.41GHz PTX FMA half 22920 45840
32x Xeon 8268, 2.90GHz
(full Karolina HPE Superdome Flex node)
AVX-512 FMA double 3650 58400
Nvida Quadro RTX 6000, 1.77GHz PTX WMMA half 7.79 63816
Nvidia A100-SXM4-40GB, 1.41GHz PTX FMA half2 17355 69420
Nvidia RTX 3060 Ti, 1.66GHz PTX WMMA half 8.90 72908
8x Nvidia A100-SXM4-40GB, 1.41GHz
(full Karolina supercomputer acn node)
PTX FMA double 38944 77888
Nvidia TITAN V, 1.46GHz PTX WMMA half 13.40 109772
16x Nvidia V100-SXM3, 1.59GHz (full DGX-2) PTX FMA double 65200 130400
Nvidia V100-SXM3, 1.59GHz PTX WMMA half 15.95 130662
Nvidia A30, 1.44GHz PTX WMMA half 18.0 147456
8x Nvidia A100-SXM4-40GB, 1.41GHz
(full Karolina supercomputer acn node)
PTX WMMA double 303.6 155443
Nvidia A100-SXM4-40GB, 1.41GHz PTX WMMA half 37.9 310477
16x Nvidia V100-SXM3, 1.59GHz (full DGX-2) PTX FMA half2 128584 514336
8x Nvidia A100-SXM4-40GB, 1.41GHz
(full Karolina supercomputer acn node)
PTX FMA half2 138800 555200
16x Nvidia V100-SXM3, 1.59GHz (full DGX-2) PTX WMMA half 254.3 2083226
8x Nvidia A100-SXM4-40GB, 1.41GHz
(full Karolina supercomputer acn node)
PTX WMMA half 303.1 2482995

License

Copyright (c) 2018, Branislav Jansik
All rights reserved.

Redistribution and use in source and binary forms, without modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

  3. Modifications, code reuse and derivative works are allowed only upon written consent by the copyright holder.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.