diff --git a/docs.it4i/cs/guides/grace.md b/docs.it4i/cs/guides/grace.md new file mode 100644 index 0000000000000000000000000000000000000000..d51729b31a5de109b932a8423d6959cfdea1414f --- /dev/null +++ b/docs.it4i/cs/guides/grace.md @@ -0,0 +1,282 @@ +# Using NVIDIA Grace Partition + +For testing your application on the IBM Power partition, +you need to prepare a job script for that partition or use the interactive job: + +```console +salloc -N 1 -c 144 -A PROJECT-ID -p p11-grace --time=08:00:00 +``` + +where: + +- `-N 1` means allocation single node, +- `-c 144` means allocation 144 cores, +- `-p p11-grace` is NVIDIA Grace partition, +- `--time=08:00:00` means allocation for 8 hours. + +## Available Toolchains + +The platform offers three toolchains: +- Standard GCC (as a module `ml GCC`) +- [NVHPC](https://developer.nvidia.com/hpc-sdk) (as a module `ml NVHPC`) +- [Clang for NVIDIA Grace](https://developer.nvidia.com/grace/clang) (installed in `/opt/nvidia/clang`) + +!!! note + The NVHPC toolchain showed strong results with minimal amount of tunning necessary in our initial evaluation. + +### GCC Toolchain + +The GCC compiler seems to struggle with vectorization of short (constant length) loops, which tend to get completely unrolled/eliminated instead of being vectorized. For example simple nested loop such as + +```cpp +for(int i = 0; i < 1000000; ++i) { + // Iterations dependent in "i" + // ... + for(int j = 0; j < 8; ++j) { + // but independent in "j" + // ... + } +} +``` +may emit scalar code for the inner loop leading to no vectorization being used at all. + +### Clang (for Grace) Toolchain + +The Clang/LLVM tends to behave similarly, but can be guided to properly vectorize the inner loop with either flags `-O3 -ffast-math -march=native -fno-unroll-loops -mllvm -force-vector-width=8` or pragmas such as `#pragma clang loop vectorize_width(8)` and `#pragma clang loop unroll(disable)`. + +```cpp +for(int i = 0; i < 1000000; ++i) { + // Iterations dependent in "i" + // ... + #pragma clang loop unroll(disable) vectorize_width(8) + for(int j = 0; j < 8; ++j) { + // but independent in "j" + // ... + } +} +``` + +!!! note + Our basic experiments show that fixed width vectorization (NEON) tends to perform better in the case of short (register-length) loops than SVE. In cases (like above), where specified `vectorize_width` is larger than avaliable vector unit width, Clang will emit multiple NEON instructions (eg. 4 instructions will be emitted to process 8 64-bit operations in 128-bit units of Grace). + +### NVHPC Toolchain + +The NVHPC toolchain handled aforementioned case without any additional tunning. Simple `-O3 -march=native -fast` should be therefore sufficient. + +## Basic Math Libraries + +The basic libraries (BLAS and LAPACK) are included in NVHPC toolchain and can be used simply as `-lblas` and `-llapack` for BLAS and LAPACK respectively (`lp64` and `ilp64` versions are also included). + +!!! note + The Grace platform doesn't include CUDA-capable GPU, therefore `nvcc` will fail with an error. This means that `nvc`, `nvc++` and `nvfortran` should be used instead. + +### NVIDIA Performance Libraries + +The [NVPL](https://developer.nvidia.com/nvpl) package includes more extensive set of libraries in both sequential and multi-threaded versions: +- BLACS: `-lnvpl_blacs_{lp64,ilp64}_{mpich,openmpi3,openmpi4,openmpi5}` +- BLAS: `-lnvpl_blas_{lp64,ilp64}_{seq,gomp}` +- FFTW: `-lnvpl_fftw` +- LAPACK: `-lnvpl_lapack_{lp64,ilp64}_{seq,gomp}` +- ScaLAPACK: `-lnvpl_scalapack_{lp64,ilp64}` +- RAND: `-lnvpl_rand` or `-lnvpl_rand_mt` +- SPARSE: `-lnvpl_sparse` + +This package should be compatible with all avaliable toolchains and includes CMake module files for easy integration into CMake-based projects. For further documentation see also [NVPL](https://docs.nvidia.com/nvpl). + +## Basic Communication Libraries + +The OpenMPI 4 implementation is included with NVHPC toolchain and is exposed as a module (`ml OpenMPI`). The following example + +```cpp +#include <mpi.h> +#include <sched.h> +#include <omp.h> + +int main(int argc, char **argv) +{ + int rank; + MPI_Init(&argc, &argv); + MPI_Comm_rank(MPI_COMM_WORLD, &rank); + #pragma omp parallel + { + printf("Hello on rank %d, thread %d on CPU %d\n", rank, omp_get_thread_num(), sched_getcpu()); + } + MPI_Finalize(); +} +``` + +can be compiled and run as follows + +```console +ml OpenMPI +mpic++ -fast -fopenmp hello.cpp -o hello +OMP_PROC_BIND=close OMP_NUM_THREADS=4 mpirun -np 4 --map-by slot:pe=36 ./hello +``` +In this configuration we run 4 ranks bound to one quarter of cores each with 4 OpenMP threads. + +## Simple BLAS Application + +The `hello world` example application (written in `C++` and `Fortran`) uses simple stationary probability vector estimation to illustrate use of GEMM (BLAS 3 routine). + +Stationary probability vector estimation in `C++`: + +```cpp +#include <iostream> +#include <vector> +#include <chrono> +#include "cblas.h" + +const size_t ITERATIONS = 32; +const size_t MATRIX_SIZE = 1024; + +int main(int argc, char *argv[]) +{ + const size_t matrixElements = MATRIX_SIZE*MATRIX_SIZE; + + std::vector<float> a(matrixElements, 1.0f / float(MATRIX_SIZE)); + + for(size_t i = 0; i < MATRIX_SIZE; ++i) + a[i] = 0.5f / (float(MATRIX_SIZE) - 1.0f); + a[0] = 0.5f; + + std::vector<float> w1(matrixElements, 0.0f); + std::vector<float> w2(matrixElements, 0.0f); + + std::copy(a.begin(), a.end(), w1.begin()); + + std::vector<float> *t1, *t2; + t1 = &w1; + t2 = &w2; + + auto c1 = std::chrono::steady_clock::now(); + + for(size_t i = 0; i < ITERATIONS; ++i) + { + std::fill(t2->begin(), t2->end(), 0.0f); + + cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, MATRIX_SIZE, MATRIX_SIZE, MATRIX_SIZE, + 1.0f, t1->data(), MATRIX_SIZE, + a.data(), MATRIX_SIZE, + 1.0f, t2->data(), MATRIX_SIZE); + + std::swap(t1, t2); + } + + auto c2 = std::chrono::steady_clock::now(); + + for(size_t i = 0; i < MATRIX_SIZE; ++i) + { + std::cout << (*t1)[i*MATRIX_SIZE + i] << " "; + } + + std::cout << std::endl; + + std::cout << "Elapsed Time: " << std::chrono::duration<double>(c2 - c1).count() << std::endl; + + return 0; +} +``` + +Stationary probability vector estimation in `Fortran`: + +```fortran +program main + implicit none + + integer :: matrix_size, iterations + integer :: i + real, allocatable, target :: a(:,:), w1(:,:), w2(:,:) + real, dimension(:,:), contiguous, pointer :: t1, t2, tmp + real, pointer :: out_data(:), out_diag(:) + integer :: cr, cm, c1, c2 + + iterations = 32 + matrix_size = 1024 + + call system_clock(count_rate=cr) + call system_clock(count_max=cm) + + allocate(a(matrix_size, matrix_size)) + allocate(w1(matrix_size, matrix_size)) + allocate(w2(matrix_size, matrix_size)) + + a(:,:) = 1.0 / real(matrix_size) + a(:,1) = 0.5 / real(matrix_size - 1) + a(1,1) = 0.5 + + w1 = a + w2(:,:) = 0.0 + + t1 => w1 + t2 => w2 + + call system_clock(c1) + + do i = 0, iterations + t2(:,:) = 0.0 + + call sgemm('N', 'N', matrix_size, matrix_size, matrix_size, 1.0, t1, matrix_size, a, matrix_size, 1.0, t2, matrix_size) + + tmp => t1 + t1 => t2 + t2 => tmp + end do + + call system_clock(c2) + + out_data(1:size(t1)) => t1 + out_diag => out_data(1::matrix_size+1) + + print *, out_diag + print *, "Elapsed Time: ", (c2 - c1) / real(cr) + + deallocate(a) + deallocate(w1) + deallocate(w2) +end program main +``` + +### Using NVHPC Toolchain + +The C++ version of the example can be compiled with NVHPC and ran as folows + +```console +ml NVHPC +nvc++ -O3 -march=native -fast -I$NVHPC/Linux_aarch64/$EBVERSIONNVHPC/compilers/include/lp64 -lblas main.cpp -o main +OMP_NUM_THREADS=144 OMP_PROC_BIND=spread ./main +``` + +The Fortran version is just as simple: + +```console +ml NVHPC +nvfortran -O3 -march=native -fast -lblas main.f90 -o main.x +OMP_NUM_THREADS=144 OMP_PROC_BIND=spread ./main +``` + +!!! note + It may be advantageous to use NVPL libraries instead NVHPC ones. For example DGEMM BLAS 3 routine from NVPL is almost 30% faster than NVHPC one. + +### Using Clang (for Grace) Toolchain + +Similarly Clang for Grace toolchain with NVPL BLAS can be used to compile C++ version of the example. + +```console +ml NVHPC +/opt/nvidia/clang/17.23.11/bin/clang++ -O3 -march=native -ffast-math -I$NVHPC/Linux_aarch64/$EBVERSIONNVHPC/compilers/include/lp64 -lnvpl_blas_lp64_gomp main.cpp -o main +``` + +!!! note + NVHPC module is used just for the `cblas.h` include in this case. This can be avoided by changing the code to use `nvpl_blas.h` instead. + +## Additional Resources + +- [https://www.nvidia.com/en-us/data-center/grace-cpu-superchip/][1] +- [https://developer.nvidia.com/hpc-sdk][2] +- [https://developer.nvidia.com/grace/clang][3] +- [https://docs.nvidia.com/nvpl][4] + +[1]: https://www.nvidia.com/en-us/data-center/grace-cpu-superchip/ +[2]: https://developer.nvidia.com/hpc-sdk +[3]: https://developer.nvidia.com/grace/clang +[4]: https://docs.nvidia.com/nvpl \ No newline at end of file