-
Branislav Jansik authoredBranislav Jansik authored
Mandelbrot benchmark
Processor benchmark. Measures pure floating point performance of the processor (x86 CPU, ARM CPU, Power CPU, NVidia GPU and AMD GPU).
Code by Branislav Jansik, IT4Innovations
Intro
The Mandelbrot benchmark measures pure floating point performance of the processor (x86 CPU,ARM CPU, Power CPU, NVidia GPU and AMD GPU). All calculations are on registers only, no memory or disk access. Very close to peak floating point performance is sustained, nominal peak may be exceeded. The code puts extreme load on processor (CPU or GPU), forcing operation at TDP. Thermal throttling may be observed.
The code is implemented in assembly language. Fused multiply add (FMA) vector instructions or warp matrix-multiply-add (WMMA) instructions are executed. Instruction sets supported : x86 (SSE), x86_64(SSE, AVX, AVX2, AVX-512), ARM(AArch64 NEON, SVE), cuda(PTX), amd(RDNA), Power (ppc64/ppc64le VSX).
Optimized to perfectly fill the instruction pipeline, up to two instructions are retired every clock cycle on x86 architectures.
The x86/AArch64/Power code is parallelized via MPI and OpenMP, the cuda(PTX) and amd(RDNA) via asynchronous kernel launch and asynchronous device access. Every MPI process, OMP thread or GPU device runs exactly the same workload.
Results are obtained in GFLIPS (Giga (
Convert to Gflop/s (Giga floating point operations per second) by applying a factor according to the instruction set and precision:
inst. set | precision | format | factor |
---|---|---|---|
SSE | single | FP32 | 4 |
SSE | double | FP64 | 2 |
AVX | double | FP64 | 4 |
AVX2 FMA | double | FP64 | 8 |
AVX-512 FMA | double | FP64 | 16 |
PTX FMA | half | FP16 | 2 |
PTX FMA | half2 | FP16 | 4 |
PTX FMA | single | FP32 | 2 |
PTX FMA | double | FP64 | 2 |
RDNA FMA | single | FP32 | 2 |
RDNA FMA | double | FP64 | 2 |
CDNA MFMA | single | FP32 | 1280 |
PTX WMMA | half | FP16 | 8192 |
PTX WMMA | double | FP64 | 512 |
SVE-128 FMA | double | FP64 | 4 |
SVE-512 FMA | double | FP64 | 16 |
NEON FMLA | double | FP64 | 4 |
NEON | double | FP64 | 2 |
VSX FMA | double | FP64 | 4 |
VSX GER | double | FP64 | 16 |
Theory
The benchmark runs the Mandelbrot iterations. Such iterations remain bounded indefinitely, which allows the benchmark to execute caluclations on meaningful numbers (no Inf, NaN, etc.) as long as it takes to accumulate desired runtime.
The Mandelbrot iterations are defined as
Source
Use the following source files according to instruction set and precision:
source | inst. set | precision | arch. | comment |
---|---|---|---|---|
mandelbrot-real-sse-mpi-dump.c | SSE | single | x86_64 | Nehalem. Uses SSE 64 bit MUL and ADD instructions. Also availeble in 32bit SSE variant for x86. |
mandelbrot-real-avx-mpi-dump.c | AVX | double | x86_64 | Sandybridge. Runs MUL and ADD instructions. |
mandelbrot-real-fma-mpi-mem4-dump.c | AVX2 FMA | double | x86_64 | Haswell. Runs FMA. 4 steps to cache for optimal pipelining. |
mandelbrot-real-fma-mpi-dump-mic.c | AVX-512 FMA | double | x86_64 | Works for MIC, SKX and KNL architectures incl. Xeon Phi, Skylake. Runs FMA. |
mandelbrot-real-fma-ptx-dump.cu | PTX FMA | double | NVIDIA sm_53 | Uses Nvidia PTX FMA instuctions. Optimized for Nvidia K20 peak performance. Set NBLOCKS and NTHREADS accordingly for other devices. |
mandelbrot-real-wmma-ptx-f16-dump.cu | PTX WMMA | half | NVIDIA sm_70 | Uses PTX WMMA (Warp Matrx-matrix Multiply Add) instructions, targeting NVIDIA V100 tensor cores. Optimized for Nvidia V100 peak performance. Set NBLOCKS and NTHREADS accordingly for other devices. |
mandelbrot-real-wmma-ptx-f64-dump.cu | PTX WMMA | double | NVIDIA sm_80 | Uses PTX WMMA (Warp Matrx-matrix Multiply Add) instructions, targeting NVIDIA A100 tensor cores. Optimized for Nvidia A100 peak performance. Set NBLOCKS and NTHREADS accordingly for other devices. |
mandelbrot-real-fma-rdna-f64-dump.cpp | RDNA FMA | double | AMD RDNA | Uses RDNA FMA instuctions (AMD Radeon GPUs). Optimized for AMD Radeon MI100. Set NBLOCKS and NTHREADS accordingly for other devices. Also available in FP32. |
accumulator-mfma-cdna-f32.cpp | CDNA MFMA | double | AMD CDNA | Uses AMD Instinct MI100 ISA MFMA 16x16x4 matrix multiplication instuctions. Optimized for AMD Radeon MI100. Set NBLOCKS and NTHREADS accordingly for other devices. |
mandelbrot-real-fma-sve-f64-omp.c | SVE FMA | double | AArch64 | ARM AArch64. Supports vector-length agnostic SVE FMA instructions. The code automatically adapts to different SVE register vector lengths. |
jansik-real-fmla-neon-f64-omp.c | NEON FMLA | double | AArch64 | ARM AArch64. Supports ARM NEON (128bit) FMLA instructions. The code runs Mandelbrot inspired Jansik iterations to accomodate FMLA semantics. |
mandelbrot-real-neon-f64-omp.c | NEON | double | AArch64 | ARM AArch64. Runs ARM NEON (128bit) FMUL and FADD instructions. |
mandelbrot-real-fma-power-f64-omp.c | VSX FMA | double | Power ppc64/ppc64le | Mandelbrot variant for the OpenPOWER Power ISA architecture. Executes Power VSX (128bit) FMA instructions. |
accumulator-ger-power-f64-omp.c | VSX GER | double | Power ppc64/ppc64le | Code executing Power VSX (128bit) rank-2 update GER instructions. These compute outer product between 4x1 and 2x1 double precision vectors and store results in dedicated accumulator registers. |
cpuid.c | x86 CPUID | N/A | x86 | Runs CPUID instruction. Find out about x86 CPU capabilities. |
Build
Make Build
Requires gcc or icc compiler, optionally MPI and CUDA.
$ make
Usage: make [icc] target [target [target] ... ]
For example
$ make icc omp mpi
$ make mandelbrot-real-fma-ptx-dump.x NBLOCKS=1296 NTHREADS=32 # Compile for A100
Manual Build
As the code is in assembly, the choice of compiler or optimization flags does not matter.
Choose the appropriate source file and compile:
$ mpicc filename-mpi-.c -o filename.x
$ icc -qopenmp filename-omp-.c -o filename.x
$ gcc -fopenmp filename-omp-.c -o filename.x -lm
$ nvcc filename.cu -o filename.x
$ hipcc filename.cpp -o filename.x
For WMMA build, see the source header.
Some older compilers may complain about the ymm or zmm registers in the clobbers list. It is safe to comment out this section of the code.
Windows binaries
For convenience, we provide some MinGW Windows compiled executables.
cpuid.exe
mandelbrot-real-sse-omp32.exe
mandelbrot-real-sse-omp.exe
mandelbrot-real-avx-omp.exe
mandelbrot-real-fma-omp.exe
mandelbrot-real-fma512-omp.exe
Run in Command Prompt or PowerShell. Windows Defender and other antivir software
may flag the executables as threat. This is a false positive.
sha256sum.txt
Run
Run as MPI, OpenMP or GPU executable:
$ mpirun -n number_of_cores ./filename.x [number_of_repetitions]
$ OMP_PROC_BIND=true ./filename-omp.x [number_of_repetitions]
$ ./filename.x [number_of_repetitions]
The OMP variant will use all available CPU cores, the GPU variants will use all available GPUs. Control the execution by setting OMP_NUM_THREADS, CUDA_VISIBLE_DEVICES or HIP_VISIBLE_DEVICES environment variables:
$ export OMP_NUM_THREADS=4 # f.x. run on 4 cores only.
$ export CUDA_VISIBLE_DEVICES=0 # f.x. run on device no. 0 only
Benchmarks
Processor | inst. set | precision | GFLIPS | Gflop/s |
---|---|---|---|---|
Atom N280, 1.66GHz | SSE | single | 1.6 | 6.4 |
i3-2370M, 2.40GHz | AVX | double | 9.5 | 38.0 |
Broadcom BCM2712 ARM Cortex-A76, 4c, 2.4GHz (Raspberry Pi 5) |
NEON FMLA | double | 19.2 | 76.8 |
i5-3470, 3.20GHz | AVX | double | 26.5 | 106.0 |
Samsung Exynos 2200, 8c SoC, 1.8-2.8GHz (Samsung Galaxy S22) |
SVE-128 FMA | double | 26.7 | 106.8 |
i7-10510U, 2.30GHz | AVX2 FMA | double | 17.7 | 141.6 |
Apple M3, 8c, 4.05GHz | NEON | double | 73.3 | 146.6 |
i5-6300HQ, 2.30GHz | AVX2 FMA | double | 19.2 | 153.6 |
i7-10510U, 1.80GHz | AVX2 FMA | double | 22.2 | 177.6 |
i7-1068NG7, 2.30GHz | AVX-512 FMA | double | 12.3 | 196.8 |
Nvidia GTX 1080Ti, 1.58GHz | PTX FMA | half2 | 50.5 | 202.0 |
i7-4790, 3.60GHz | AVX2 FMA | double | 28.3 | 226.4 |
Nvidia Tesla T4, 1.59GHz | PTX FMA | double | 127.3 | 254.6 |
Apple M3, 8c, 4.05GHz | NEON FMLA | double | 72.6 | 290.4 |
2x E5-2470, 2.30GHz | AVX | double | 84.0 | 336.0 |
2x E5-2665, 2.40GHz | AVX | double | 88.3 | 353.2 |
i5-9400F, 2.9GHz | AVX2 FMA | double | 46.2 | 369.6 |
Nvidia GTX 1080Ti, 1.58GHz | PTX FMA | double | 201.6 | 403.2 |
AMD Ryzen 7 PRO 7840U, 3.30GHz | AVX2 FMA | double | 51.5 | 412.0 |
AMD Ryzen 7 PRO 7840U, 3.30GHz | AVX-512 FMA | double | 26.4 | 422.4 |
Xeon D-1587, 1.70GHz | AVX2 FMA | double | 54.2 | 433.6 |
Nvida Quadro RTX 6000, 1.77GHz | PTX FMA | double | 266.0 | 532.0 |
i5-11400F, 2.60GHz | AVX-512 FMA | double | 35.5 | 568.0 |
2x ARM Cortex-A72, 32c, 2.0GHz | NEON FMLA | double | 152.0 | 608.0 |
2x Opteron 6376, 2.3GHz | AVX2 FMA | double | 76.6 | 612.8 |
2x EPYC 7351, 2.40GHz | AVX | double | 181.3 | 725.2 |
2x EPYC 7351, 2.40GHz | AVX2 FMA | double | 91.0 | 728.0 |
2x E5-2680v3, 2.50GHz | AVX2 FMA | double | 125.7 | 1005.6 |
Nvidia K20, 0.71GHz | PTX FMA | double | 586.2 | 1172.4 |
Xeon Phi 7120P, 1.24GHz | AVX-512 FMA | double | 74.8 | 1196.8 |
2x EPYC 7513, 2.60GHz | AVX | double | 308.9 | 1235.6 |
2x EPYC 7513, 2.60GHz | AVX2 FMA | double | 157.1 | 1256.8 |
Nvidia MX330, 1.59GHz | PTX FMA | single | 648.8 | 1297.6 |
Nvidia K20Xm, 0,73GHz | PTX FMA | double | 654.1 | 1308.2 |
Nvidia K40c, 0.75GHz | PTX FMA | double | 703.1 | 1406.2 |
Nvidia GTX 960M, 1.18GHz | PTX FMA | single | 748.2 | 1496.4 |
2x Xeon Gold 6126, 2.60GHz | AVX-512 FMA | double | 109.6 | 1753.6 |
Nvidia Quadro K5000, 0.71GHz | PTX FMA | single | 910.1 | 1820.2 |
ARM Ampere Altra Q80-30, 3.00GHz | NEON FMLA | double | 479.0 | 1916.0 |
2x Xeon Gold 6130, 2.10GHz | AVX-512 FMA | double | 121.2 | 1939.2 |
2x Xeon Gold 6138, 2.00GHz | AVX-512 FMA | double | 148.5 | 2376.0 |
Fujitsu ARM A64FX, 2.00GHz | SVE-512 FMA | double | 160.0 | 2560.0 |
Xeon Phi 7210, 1.30GHz | AVX-512 FMA | double | 160.9 | 2574.4 |
2x Power10 12-CORE 2.90 TO 4.0 GHz (IBM POWER S1022) |
VSX FMA | double | 655.0 | 2620.0 |
2x Xeon Gold 6240, 2.60GHz | AVX-512 FMA | double | 172.9 | 2766.4 |
Nvidia K20, 0.71GHz | PTX FMA | single | 1532.9 | 3065.8 |
Nvidia K20Xm, 0,73GHz | PTX FMA | single | 1706.9 | 3413.8 |
Nvidia K40c, 0.75GHz | PTX FMA | single | 1734.8 | 3469.6 |
2x Xeon 8168, 2.70GHz | AVX-512 FMA | double | 236.1 | 3777.6 |
2x Xeon Gold 6338, 2.00Ghz | AVX-512 FMA | double | 256.1 | 4097.6 |
Nvidia Tesla P100-PCIE-12GB, 1.33GHz | PTX FMA | double | 2380.0 | 4760.0 |
Nvidia A30, 1.44GHz | PTX FMA | double | 2514 | 5028 |
2x EPYC 7763, 2.45GHz | AVX2 FMA | double | 662.0 | 5296.0 |
2x EPYC 7H12, 2.60GHz | AVX2 FMA | double | 664.7 | 5317.8 |
2x EPYC 7773X, 2.20GHz | AVX2 FMA | double | 697.6 | 5580.8 |
2x Power10 12-CORE 2.90 TO 4.0 GHz (IBM POWER S1022) |
VSX GER | double | 360.5 | 5768.0 |
8x Xeon 8153, 2.00GHz (full X808 ) | AVX-512 FMA | double | 392.0 | 6272.0 |
Nvidia Tesla T4, 1.59GHz | PTX FMA | single | 3200 | 6400 |
Nvidia TITAN V, 1.46GHz | PTX FMA | double | 3434.9 | 6869.8 |
Nvidia ARM Grace CPU Superchip, 3.1GHz | SVE-128 FMA | double | 1840 | 7360 |
2x Xeon CPU Max 9468, 2.10GHz | AVX-512 FMA | double | 475.0 | 7600.0 |
Nvidia V100-SXM3, 1.59GHz | PTX FMA | double | 4088.1 | 8176.2 |
Nvidia Tesla P100-PCIE-12GB, 1.33GHz | PTX FMA | single | 4737.2 | 9474.4 |
AMD Radeon MI100, 1.50GHz | RDNA FMA | double | 4807.0 | 9614 |
Nvidia A100-SXM4-40GB, 1.41GHz | PTX FMA | double | 4864.0 | 9728 |
Nvidia A30, 1.44GHz | PTX WMMA | double | 19.0 | 9728 |
Nvidia A30, 1.44GHz | PTX FMA | single | 5123 | 10246 |
Nvidia GTX 1080Ti, 1.58GHz | PTX FMA | single | 6000.0 | 12000 |
Nvidia Tesla T4, 1.59GHz | PTX FMA | half2 | 3200 | 12800 |
Nvidia TITAN V, 1.46GHz | PTX FMA | single | 6870.2 | 13740 |
Nvida Quadro RTX 6000, 1.77GHz | PTX FMA | single | 8100.0 | 16200 |
Nvidia V100-SXM3, 1.59GHz | PTX FMA | half | 8174.0 | 16348 |
Nvidia V100-SXM3, 1.59GHz | PTX FMA | single | 8175.0 | 16350 |
Nvidia RTX 3060 Ti, 1.66GHz | PTX FMA | single | 8407.7 | 16815 |
Nvidia RTX 3060 Ti, 1.66GHz | PTX FMA | half2 | 4292.9 | 17172 |
Nvidia Tesla P100-PCIE-12GB, 1.33GHz | PTX FMA | half2 | 4737.1 | 18948 |
Nvidia A100-SXM4-40GB, 1.41GHz | PTX FMA | single | 9731.0 | 19462 |
Nvidia A100-SXM4-40GB, 1.41GHz | PTX WMMA | double | 38.02 | 19466 |
AMD Radeon MI100, 1.50GHz | RDNA FMA | single | 10499 | 20998 |
Nvidia Tesla T4, 1.59GHz | PTX WMMA | half | 3.20 | 26214 |
Nvidia TITAN V, 1.46GHz | PTX FMA | half2 | 6863.8 | 27455 |
AMD Radeon MI100, 1.50GHz | CDNA MFMA | single | 21.95 | 28096 |
Nvida Quadro RTX 6000, 1.77GHz | PTX FMA | half2 | 7800.0 | 31200 |
Nvidia V100-SXM3, 1.59GHz | PTX FMA | half2 | 8174.0 | 32696 |
Nvidia A30, 1.44GHz | PTX FMA | half2 | 8700 | 34800 |
Nvidia A100-SXM4-40GB, 1.41GHz | PTX FMA | half | 22920 | 45840 |
32x Xeon 8268, 2.90GHz (full Karolina HPE Superdome Flex node) |
AVX-512 FMA | double | 3650 | 58400 |
Nvida Quadro RTX 6000, 1.77GHz | PTX WMMA | half | 7.79 | 63816 |
Nvidia A100-SXM4-40GB, 1.41GHz | PTX FMA | half2 | 17355 | 69420 |
Nvidia RTX 3060 Ti, 1.66GHz | PTX WMMA | half | 8.90 | 72908 |
8x Nvidia A100-SXM4-40GB, 1.41GHz (full Karolina supercomputer acn node) |
PTX FMA | double | 38944 | 77888 |
Nvidia TITAN V, 1.46GHz | PTX WMMA | half | 13.40 | 109772 |
16x Nvidia V100-SXM3, 1.59GHz (full DGX-2) | PTX FMA | double | 65200 | 130400 |
Nvidia V100-SXM3, 1.59GHz | PTX WMMA | half | 15.95 | 130662 |
Nvidia A30, 1.44GHz | PTX WMMA | half | 18.0 | 147456 |
8x Nvidia A100-SXM4-40GB, 1.41GHz (full Karolina supercomputer acn node) |
PTX WMMA | double | 303.6 | 155443 |
Nvidia A100-SXM4-40GB, 1.41GHz | PTX WMMA | half | 37.9 | 310477 |
16x Nvidia V100-SXM3, 1.59GHz (full DGX-2) | PTX FMA | half2 | 128584 | 514336 |
8x Nvidia A100-SXM4-40GB, 1.41GHz (full Karolina supercomputer acn node) |
PTX FMA | half2 | 138800 | 555200 |
16x Nvidia V100-SXM3, 1.59GHz (full DGX-2) | PTX WMMA | half | 254.3 | 2083226 |
8x Nvidia A100-SXM4-40GB, 1.41GHz (full Karolina supercomputer acn node) |
PTX WMMA | half | 303.1 | 2482995 |
License
Copyright (c) 2018, Branislav Jansik
All rights reserved.
Redistribution and use in source and binary forms, without modification, are permitted provided that the following conditions are met:
-
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
-
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
-
Modifications, code reuse and derivative works are allowed only upon written consent by the copyright holder.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.