Commit 1c28f8df authored by Branislav Jansik's avatar Branislav Jansik
Browse files

Merge branch 'master' of https://code.it4i.cz/jansik/mandelbrot

parents 810721ea edb0e351
......@@ -4,8 +4,8 @@ Code by Branislav Jansik, IT4Innovations
## Intro
The mandelbrot benchmark measures pure floating point performance of the processor (x86 CPU or NVidia GPU). It runs
the mandelbrot iterations $`z_{k+1}=z^2_k + c`$ where $`z_0 = 0`$ and the constant $`c`$ is from the Mandelbrot set of complex numbers.
For simplicity and efficiency, we often select only the numbers on the real axis, as denoted by -real- in the source code filename.
the mandelbrot iterations $`z_{k+1}=z^2_k + c`$ where $`z_0 = 0`$ and the constant $`c`$ is from the Mandelbrot set of complex numbers. For simplicity and efficiency, we often select only the numbers on the real axis, as denoted by -real- in the source code filename. In case of NVidia GPU tensor operations, it runs mandelbrot matrix interations $`Z_{k+1}=Z_kZ_k + C`$ where the $`Z_0`$ is zero matrix and the $`C`$ matrix has eigenvalues within the Mandelbrot set. Such iterations remain bounded indefinitely.
The x86 code is using vector registers and high energy, fused multiply-add (FMA) vector instructions whenever architecture allows it. The Nvidia PTX code
is using fused multiply-add (FMA) instructions or warp matrix-multiply-add (WMMA) instructions. All calculations
......@@ -42,13 +42,14 @@ Convert to Gflop/s (Giga Floating Point Operations Per Second) by applying a fac
## Source
Use the following source files according to instruction set and precision:
* SSE (Nehalem) single precision: mandelbrot-real-sse-mpi-dump.c . Uses SSE 64 bit instructions.
* AVX (Sandybridge) double precision: mandelbrot-real-avx-mpi-dump.c, mandelbrot-real-avx-mpi-dump-papi.c (PAPI version)
* AVX2 FMA (Haswell) double precision: mandelbrot-real-fma-mpi-mem4-dump.c (Best Xeon version, 4 steps to cache for optimal pipelining)
* AVX-512 FMA (Xeon Phi, Skylake) double precision: mandelbrot-real-fma-mpi-dump-mic.c . Uses AVX-512 FMA instructions. Works for MIC, SKX and KNL architectures
* PTX FMA (NVIDIA CUDA PTX) double precision: mandelbrot-real-fma-ptx-dump.cu . Uses PTX FMA instuctions. Optimized for Nvidia K20 peak performance. Set NBLOCKS and NTHREADS accordingly for other devices
* PTX WMMA (NVIDIA CUDA WMMA) half precision: mandelbrot-real-wmma-ptx-f16-dump.cu . Uses PTX WMMA (Warp Matrx-matrix Multiply Add) instructions, targeting NVIDIA V100 tensor cores.
* CPUID: cpuid.c. Find out about CPU capabilities.
* SSE (Nehalem) single precision: `mandelbrot-real-sse-mpi-dump.c`. Uses SSE 64 bit instructions.
* AVX (Sandybridge) double precision: `mandelbrot-real-avx-mpi-dump.c`.
* AVX2 FMA (Haswell) double precision: `mandelbrot-real-fma-mpi-mem4-dump.c`. 4 steps to cache for optimal pipelining
* AVX-512 FMA (Xeon Phi, Skylake) double precision: `mandelbrot-real-fma-mpi-dump-mic.c`. Works for MIC, SKX and KNL architectures
* PTX FMA (NVIDIA CUDA PTX) double precision: `mandelbrot-real-fma-ptx-dump.cu` . Uses PTX FMA instuctions. Optimized for Nvidia K20 peak performance. Set NBLOCKS and NTHREADS accordingly for other devices.
* PTX WMMA (NVIDIA CUDA WMMA) half precision: `mandelbrot-real-wmma-ptx-f16-dump.cu`. Uses PTX WMMA (Warp Matrx-matrix Multiply Add) instructions, targeting NVIDIA V100 tensor cores. Optimized for Nvidia V100 peak performance. Set NBLOCKS and NTHREADS accordingly for other devices.
* PTX WMMA (NVIDIA CUDA WMMA) double precision: `mandelbrot-real-wmma-ptx-f64-dump.cu`. Uses PTX WMMA (Warp Matrx-matrix Multiply Add) instructions, targeting NVIDIA A100 tensor cores. Optimized for Nvidia A100 peak performance. Set NBLOCKS and NTHREADS accordingly for other devices.
* CPUID: `cpuid.c`. Find out about CPU capabilities.
## Build
......@@ -156,12 +157,14 @@ $ export CUDA_VISIBLE_DEVICES=0 # f.x. run on device no. 0 only
|Nvidia TITAN V, 1.46GHz | PTX FMA | double | 3434.9| 6869.8 |
|Nvidia V100-SXM3, 1.59GHz | PTX FMA | double | 4088.1| 8176.2 |
|Nvidia Tesla P100-PCIE-12GB, 1.33GHz | PTX FMA | single | 4737.2| 9474.4 |
|Nvidia A100-SXM4-40GB, 1.41GHz | PTX FMA | double | 4864.0| 9728 |
|Nvidia GTX 1080Ti, 1.58GHz | PTX FMA | single | 6000.0| 12000 |
|Nvidia Tesla T4, 1.59GHz | PTX FMA | half2 | 3200 | 12800 |
|Nvidia TITAN V, 1.46GHz | PTX FMA | single | 6870.2| 13740 |
|Nvidia V100-SXM3, 1.59GHz | PTX FMA | half | 8174.0| 16348 |
|Nvidia V100-SXM3, 1.59GHz | PTX FMA | single | 8175.0| 16350 |
|Nvidia Tesla P100-PCIE-12GB, 1.33GHz | PTX FMA | half2 | 4737.1| 18948 |
|Nvidia A100-SXM4-40GB, 1.41GHz | PTX FMA | single | 9731.0| 19462 |
|Nvidia A100-SXM4-40GB, 1.41GHz | PTX WMMA | double | 38.02 | 19466 |
|Nvidia Tesla T4, 1.59GHz | PTX WMMA | half | 3.20 | 26214 |
|Nvidia TITAN V, 1.46GHz | PTX FMA | half2 | 6863.8| 27455 |
......@@ -169,8 +172,11 @@ $ export CUDA_VISIBLE_DEVICES=0 # f.x. run on device no. 0 only
|Nvidia TITAN V, 1.46GHz | PTX WMMA | half | 13.40 | 109772 |
|16x Nvidia V100-SXM3, 1.59GHz (full DGX-2)| PTX FMA | double | 65200 | 130400 |
|Nvidia V100-SXM3, 1.59GHz | PTX WMMA | half | 15.95 | 130662 |
|8x Nvidia A100-SXM4-40GB, 1.41GHz <br> (full Karolina supercomputer acn node) | PTX WMMA | double | 303.6 | 155443 |
|Nvidia A100-SXM4-40GB, 1.41GHz | PTX WMMA | half | 37.9 | 310477 |
|16x Nvidia V100-SXM3, 1.59GHz (full DGX-2)| PTX FMA | half2 | 128584| 514336 |
|16x Nvidia V100-SXM3, 1.59GHz (full DGX-2)| PTX WMMA | half | 254.3 | 2083226 |
|8x Nvidia A100-SXM4-40GB, 1.41GHz <br> (full Karolina supercomputer acn node) | PTX WMMA | half | 303.1 | 2482995 |
## License
Copyright (c) 2018, Branislav Jansik
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment