Skip to content
Snippets Groups Projects

Issue #4: NVIDIA Grace CPU Guide

Merged Filip Vaverka requested to merge ivaverka/docs.it4i.cz:p11_grace into master
1 file
+ 282
0
Compare changes
  • Side-by-side
  • Inline
+ 282
0
 
# Using NVIDIA Grace Partition
 
 
For testing your application on the IBM Power partition,
 
you need to prepare a job script for that partition or use the interactive job:
 
 
```console
 
salloc -N 1 -c 144 -A PROJECT-ID -p p11-grace --time=08:00:00
 
```
 
 
where:
 
 
- `-N 1` means allocation single node,
 
- `-c 144` means allocation 144 cores,
 
- `-p p11-grace` is NVIDIA Grace partition,
 
- `--time=08:00:00` means allocation for 8 hours.
 
 
## Available Toolchains
 
 
The platform offers three toolchains:
 
- Standard GCC (as a module `ml GCC`)
 
- [NVHPC](https://developer.nvidia.com/hpc-sdk) (as a module `ml NVHPC`)
 
- [Clang for NVIDIA Grace](https://developer.nvidia.com/grace/clang) (installed in `/opt/nvidia/clang`)
 
 
!!! note
 
The NVHPC toolchain showed strong results with minimal amount of tunning necessary in our initial evaluation.
 
 
### GCC Toolchain
 
 
The GCC compiler seems to struggle with vectorization of short (constant length) loops, which tend to get completely unrolled/eliminated instead of being vectorized. For example simple nested loop such as
 
 
```cpp
 
for(int i = 0; i < 1000000; ++i) {
 
// Iterations dependent in "i"
 
// ...
 
for(int j = 0; j < 8; ++j) {
 
// but independent in "j"
 
// ...
 
}
 
}
 
```
 
may emit scalar code for the inner loop leading to no vectorization being used at all.
 
 
### Clang (for Grace) Toolchain
 
 
The Clang/LLVM tends to behave similarly, but can be guided to properly vectorize the inner loop with either flags `-O3 -ffast-math -march=native -fno-unroll-loops -mllvm -force-vector-width=8` or pragmas such as `#pragma clang loop vectorize_width(8)` and `#pragma clang loop unroll(disable)`.
 
 
```cpp
 
for(int i = 0; i < 1000000; ++i) {
 
// Iterations dependent in "i"
 
// ...
 
#pragma clang loop unroll(disable) vectorize_width(8)
 
for(int j = 0; j < 8; ++j) {
 
// but independent in "j"
 
// ...
 
}
 
}
 
```
 
 
!!! note
 
Our basic experiments show that fixed width vectorization (NEON) tends to perform better in the case of short (register-length) loops than SVE. In cases (like above), where specified `vectorize_width` is larger than avaliable vector unit width, Clang will emit multiple NEON instructions (eg. 4 instructions will be emitted to process 8 64-bit operations in 128-bit units of Grace).
 
 
### NVHPC Toolchain
 
 
The NVHPC toolchain handled aforementioned case without any additional tunning. Simple `-O3 -march=native -fast` should be therefore sufficient.
 
 
## Basic Math Libraries
 
 
The basic libraries (BLAS and LAPACK) are included in NVHPC toolchain and can be used simply as `-lblas` and `-llapack` for BLAS and LAPACK respectively (`lp64` and `ilp64` versions are also included).
 
 
!!! note
 
The Grace platform doesn't include CUDA-capable GPU, therefore `nvcc` will fail with an error. This means that `nvc`, `nvc++` and `nvfortran` should be used instead.
 
 
### NVIDIA Performance Libraries
 
 
The [NVPL](https://developer.nvidia.com/nvpl) package includes more extensive set of libraries in both sequential and multi-threaded versions:
 
- BLACS: `-lnvpl_blacs_{lp64,ilp64}_{mpich,openmpi3,openmpi4,openmpi5}`
 
- BLAS: `-lnvpl_blas_{lp64,ilp64}_{seq,gomp}`
 
- FFTW: `-lnvpl_fftw`
 
- LAPACK: `-lnvpl_lapack_{lp64,ilp64}_{seq,gomp}`
 
- ScaLAPACK: `-lnvpl_scalapack_{lp64,ilp64}`
 
- RAND: `-lnvpl_rand` or `-lnvpl_rand_mt`
 
- SPARSE: `-lnvpl_sparse`
 
 
This package should be compatible with all avaliable toolchains and includes CMake module files for easy integration into CMake-based projects. For further documentation see also [NVPL](https://docs.nvidia.com/nvpl).
 
 
## Basic Communication Libraries
 
 
The OpenMPI 4 implementation is included with NVHPC toolchain and is exposed as a module (`ml OpenMPI`). The following example
 
 
```cpp
 
#include <mpi.h>
 
#include <sched.h>
 
#include <omp.h>
 
 
int main(int argc, char **argv)
 
{
 
int rank;
 
MPI_Init(&argc, &argv);
 
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
 
#pragma omp parallel
 
{
 
printf("Hello on rank %d, thread %d on CPU %d\n", rank, omp_get_thread_num(), sched_getcpu());
 
}
 
MPI_Finalize();
 
}
 
```
 
 
can be compiled and run as follows
 
 
```console
 
ml OpenMPI
 
mpic++ -fast -fopenmp hello.cpp -o hello
 
OMP_PROC_BIND=close OMP_NUM_THREADS=4 mpirun -np 4 --map-by slot:pe=36 ./hello
 
```
 
In this configuration we run 4 ranks bound to one quarter of cores each with 4 OpenMP threads.
 
 
## Simple BLAS Application
 
 
The `hello world` example application (written in `C++` and `Fortran`) uses simple stationary probability vector estimation to illustrate use of GEMM (BLAS 3 routine).
 
 
Stationary probability vector estimation in `C++`:
 
 
```cpp
 
#include <iostream>
 
#include <vector>
 
#include <chrono>
 
#include "cblas.h"
 
 
const size_t ITERATIONS = 32;
 
const size_t MATRIX_SIZE = 1024;
 
 
int main(int argc, char *argv[])
 
{
 
const size_t matrixElements = MATRIX_SIZE*MATRIX_SIZE;
 
 
std::vector<float> a(matrixElements, 1.0f / float(MATRIX_SIZE));
 
 
for(size_t i = 0; i < MATRIX_SIZE; ++i)
 
a[i] = 0.5f / (float(MATRIX_SIZE) - 1.0f);
 
a[0] = 0.5f;
 
 
std::vector<float> w1(matrixElements, 0.0f);
 
std::vector<float> w2(matrixElements, 0.0f);
 
 
std::copy(a.begin(), a.end(), w1.begin());
 
 
std::vector<float> *t1, *t2;
 
t1 = &w1;
 
t2 = &w2;
 
 
auto c1 = std::chrono::steady_clock::now();
 
 
for(size_t i = 0; i < ITERATIONS; ++i)
 
{
 
std::fill(t2->begin(), t2->end(), 0.0f);
 
 
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, MATRIX_SIZE, MATRIX_SIZE, MATRIX_SIZE,
 
1.0f, t1->data(), MATRIX_SIZE,
 
a.data(), MATRIX_SIZE,
 
1.0f, t2->data(), MATRIX_SIZE);
 
 
std::swap(t1, t2);
 
}
 
 
auto c2 = std::chrono::steady_clock::now();
 
 
for(size_t i = 0; i < MATRIX_SIZE; ++i)
 
{
 
std::cout << (*t1)[i*MATRIX_SIZE + i] << " ";
 
}
 
 
std::cout << std::endl;
 
 
std::cout << "Elapsed Time: " << std::chrono::duration<double>(c2 - c1).count() << std::endl;
 
 
return 0;
 
}
 
```
 
 
Stationary probability vector estimation in `Fortran`:
 
 
```fortran
 
program main
 
implicit none
 
 
integer :: matrix_size, iterations
 
integer :: i
 
real, allocatable, target :: a(:,:), w1(:,:), w2(:,:)
 
real, dimension(:,:), contiguous, pointer :: t1, t2, tmp
 
real, pointer :: out_data(:), out_diag(:)
 
integer :: cr, cm, c1, c2
 
 
iterations = 32
 
matrix_size = 1024
 
 
call system_clock(count_rate=cr)
 
call system_clock(count_max=cm)
 
 
allocate(a(matrix_size, matrix_size))
 
allocate(w1(matrix_size, matrix_size))
 
allocate(w2(matrix_size, matrix_size))
 
 
a(:,:) = 1.0 / real(matrix_size)
 
a(:,1) = 0.5 / real(matrix_size - 1)
 
a(1,1) = 0.5
 
 
w1 = a
 
w2(:,:) = 0.0
 
 
t1 => w1
 
t2 => w2
 
 
call system_clock(c1)
 
 
do i = 0, iterations
 
t2(:,:) = 0.0
 
 
call sgemm('N', 'N', matrix_size, matrix_size, matrix_size, 1.0, t1, matrix_size, a, matrix_size, 1.0, t2, matrix_size)
 
 
tmp => t1
 
t1 => t2
 
t2 => tmp
 
end do
 
 
call system_clock(c2)
 
 
out_data(1:size(t1)) => t1
 
out_diag => out_data(1::matrix_size+1)
 
 
print *, out_diag
 
print *, "Elapsed Time: ", (c2 - c1) / real(cr)
 
 
deallocate(a)
 
deallocate(w1)
 
deallocate(w2)
 
end program main
 
```
 
 
### Using NVHPC Toolchain
 
 
The C++ version of the example can be compiled with NVHPC and ran as folows
 
 
```console
 
ml NVHPC
 
nvc++ -O3 -march=native -fast -I$NVHPC/Linux_aarch64/$EBVERSIONNVHPC/compilers/include/lp64 -lblas main.cpp -o main
 
OMP_NUM_THREADS=144 OMP_PROC_BIND=spread ./main
 
```
 
 
The Fortran version is just as simple:
 
 
```console
 
ml NVHPC
 
nvfortran -O3 -march=native -fast -lblas main.f90 -o main.x
 
OMP_NUM_THREADS=144 OMP_PROC_BIND=spread ./main
 
```
 
 
!!! note
 
It may be advantageous to use NVPL libraries instead NVHPC ones. For example DGEMM BLAS 3 routine from NVPL is almost 30% faster than NVHPC one.
 
 
### Using Clang (for Grace) Toolchain
 
 
Similarly Clang for Grace toolchain with NVPL BLAS can be used to compile C++ version of the example.
 
 
```console
 
ml NVHPC
 
/opt/nvidia/clang/17.23.11/bin/clang++ -O3 -march=native -ffast-math -I$NVHPC/Linux_aarch64/$EBVERSIONNVHPC/compilers/include/lp64 -lnvpl_blas_lp64_gomp main.cpp -o main
 
```
 
 
!!! note
 
NVHPC module is used just for the `cblas.h` include in this case. This can be avoided by changing the code to use `nvpl_blas.h` instead.
 
 
## Additional Resources
 
 
- [https://www.nvidia.com/en-us/data-center/grace-cpu-superchip/][1]
 
- [https://developer.nvidia.com/hpc-sdk][2]
 
- [https://developer.nvidia.com/grace/clang][3]
 
- [https://docs.nvidia.com/nvpl][4]
 
 
[1]: https://www.nvidia.com/en-us/data-center/grace-cpu-superchip/
 
[2]: https://developer.nvidia.com/hpc-sdk
 
[3]: https://developer.nvidia.com/grace/clang
 
[4]: https://docs.nvidia.com/nvpl
 
\ No newline at end of file
Loading