For testing your application on the IBM Power partition,
you need to prepare a job script for that partition or use the interactive job:
```console
salloc -N 1 -c 144 -A PROJECT-ID -p p11-grace --time=08:00:00
```
where:
-`-N 1` means allocation single node,
-`-c 144` means allocation 144 cores,
-`-p p11-grace` is NVIDIA Grace partition,
-`--time=08:00:00` means allocation for 8 hours.
## Available Toolchains
The platform offers three toolchains:
- Standard GCC (as a module `ml GCC`)
-[NVHPC](https://developer.nvidia.com/hpc-sdk)(as a module `ml NVHPC`)
-[Clang for NVIDIA Grace](https://developer.nvidia.com/grace/clang)(installed in `/opt/nvidia/clang`)
!!! note
The NVHPC toolchain showed strong results with minimal amount of tunning necessary in our initial evaluation.
### GCC Toolchain
The GCC compiler seems to struggle with vectorization of short (constant length) loops, which tend to get completely unrolled/eliminated instead of being vectorized. For example simple nested loop such as
```cpp
for(inti=0;i<1000000;++i){
// Iterations dependent in "i"
// ...
for(intj=0;j<8;++j){
// but independent in "j"
// ...
}
}
```
may emit scalar code for the inner loop leading to no vectorization being used at all.
### Clang (for Grace) Toolchain
The Clang/LLVM tends to behave similarly, but can be guided to properly vectorize the inner loop with either flags `-O3 -ffast-math -march=native -fno-unroll-loops -mllvm -force-vector-width=8` or pragmas such as `#pragma clang loop vectorize_width(8)` and `#pragma clang loop unroll(disable)`.
Our basic experiments show that fixed width vectorization (NEON) tends to perform better in the case of short (register-length) loops than SVE. In cases (like above), where specified `vectorize_width` is larger than avaliable vector unit width, Clang will emit multiple NEON instructions (eg. 4 instructions will be emitted to process 8 64-bit operations in 128-bit units of Grace).
### NVHPC Toolchain
The NVHPC toolchain handled aforementioned case without any additional tunning. Simple `-O3 -march=native -fast` should be therefore sufficient.
## Basic Math Libraries
The basic libraries (BLAS and LAPACK) are included in NVHPC toolchain and can be used simply as `-lblas` and `-llapack` for BLAS and LAPACK respectively (`lp64` and `ilp64` versions are also included).
!!! note
The Grace platform doesn't include CUDA-capable GPU, therefore `nvcc` will fail with an error. This means that `nvc`, `nvc++` and `nvfortran` should be used instead.
### NVIDIA Performance Libraries
The [NVPL](https://developer.nvidia.com/nvpl) package includes more extensive set of libraries in both sequential and multi-threaded versions:
This package should be compatible with all avaliable toolchains and includes CMake module files for easy integration into CMake-based projects. For further documentation see also [NVPL](https://docs.nvidia.com/nvpl).
## Basic Communication Libraries
The OpenMPI 4 implementation is included with NVHPC toolchain and is exposed as a module (`ml OpenMPI`). The following example
```cpp
#include<mpi.h>
#include<sched.h>
#include<omp.h>
intmain(intargc,char**argv)
{
intrank;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
#pragma omp parallel
{
printf("Hello on rank %d, thread %d on CPU %d\n",rank,omp_get_thread_num(),sched_getcpu());
In this configuration we run 4 ranks bound to one quarter of cores each with 4 OpenMP threads.
## Simple BLAS Application
The `hello world` example application (written in `C++` and `Fortran`) uses simple stationary probability vector estimation to illustrate use of GEMM (BLAS 3 routine).
Stationary probability vector estimation in `C++`: