diff --git a/docs.it4i/cs/guides/grace.md b/docs.it4i/cs/guides/grace.md index a2fba21d7a2331fa10ef44a2c73274b32e542446..3266a505dcec5bab8483602ad3ce76d6f7e9fce7 100644 --- a/docs.it4i/cs/guides/grace.md +++ b/docs.it4i/cs/guides/grace.md @@ -17,6 +17,7 @@ where: ## Available Toolchains The platform offers three toolchains: + - Standard GCC (as a module `ml GCC`) - [NVHPC](https://developer.nvidia.com/hpc-sdk) (as a module `ml NVHPC`) - [Clang for NVIDIA Grace](https://developer.nvidia.com/grace/clang) (installed in `/opt/nvidia/clang`) @@ -38,6 +39,7 @@ for(int i = 0; i < 1000000; ++i) { } } ``` + may emit scalar code for the inner loop leading to no vectorization being used at all. ### Clang (For Grace) Toolchain @@ -73,6 +75,7 @@ The basic libraries (BLAS and LAPACK) are included in NVHPC toolchain and can be ### NVIDIA Performance Libraries The [NVPL](https://developer.nvidia.com/nvpl) package includes more extensive set of libraries in both sequential and multi-threaded versions: + - BLACS: `-lnvpl_blacs_{lp64,ilp64}_{mpich,openmpi3,openmpi4,openmpi5}` - BLAS: `-lnvpl_blas_{lp64,ilp64}_{seq,gomp}` - FFTW: `-lnvpl_fftw` @@ -112,6 +115,7 @@ ml OpenMPI mpic++ -fast -fopenmp hello.cpp -o hello OMP_PROC_BIND=close OMP_NUM_THREADS=4 mpirun -np 4 --map-by slot:pe=36 ./hello ``` + In this configuration we run 4 ranks bound to one quarter of cores each with 4 OpenMP threads. ## Simple BLAS Application