From f3ceec85740926da222c78ef680d046c73e98646 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Filip=20Stan=C4=9Bk?= <filip.stanek@vsb.cz>
Date: Wed, 13 Nov 2024 08:02:42 +0100
Subject: [PATCH] =?UTF-8?q?=C3=9Aprava=20optim=C3=A1ln=C3=AD=20kompilace?=
 =?UTF-8?q?=20k=C3=B3d=C5=AF=20na=20Karol=C3=ADn=C4=9B=20s=20ohledem=20na?=
 =?UTF-8?q?=20dostupnost=20funk=C4=8Dn=C3=ADho...?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 docs.it4i/software/karolina-compilation.md | 58 ++++++++++++++++------
 1 file changed, 43 insertions(+), 15 deletions(-)

diff --git a/docs.it4i/software/karolina-compilation.md b/docs.it4i/software/karolina-compilation.md
index fc278b8db..a0e222347 100644
--- a/docs.it4i/software/karolina-compilation.md
+++ b/docs.it4i/software/karolina-compilation.md
@@ -24,29 +24,57 @@ see [Lorenz Compiler performance benchmark][a].
 ## 2. Use BLAS Library
 
 It is important to use the BLAS library that performs well on AMD processors.
-We have measured the best performance with the MKL;
-however, the MKL BLAS must be â€trickedâ€™ to believe it is working with an Intel CPU.
+To combine the optimizations for the general CPU code and have the most efficient BLAS routines we recommend the combination of lastest Intel Compiler suite, with Cray's Scientific Library bundle (LIBSCI). When using the Intel Compiler suite includes also support for efficient MPI implementation utilizing Intel MPI library over the Infiniband interconnect.
 
+For the compilation as well for the runtime of compiled code use:
+
+```code
+ml PrgEnv-intel
+ml cray-pmi/6.1.14
+
+export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CRAY_LD_LIBRARY_PATH:$CRAY_LIBSCI_PREFIX_DIR/lib:/opt/cray/pals/1.3.2/lib
+```
+There are usually two standard situation how to compile and run the code
+
+### OpenMP without MPI
+
+To compile the code against the LIBSCI, without MPI, but still enabling OpenMP run over multiple cores use:
 ```code
-ml mkl
-ml KAROLINA/FAKEintel
+icx -qopenmp -L$CRAY_LIBSCI_PREFIX_DIR/lib -I$CRAY_LIBSCI_PREFIX_DIR/include -o BINARY.x SURCE_CODE.c  -lsci_intel_mp
 ```
 
-Further, it is very important to pin the BLAS threads to the cores
-and also to restrict BLAS threads to run on a single socket of an AMD processor.
+To run the resulting binary use:
+```code
+OMP_NUM_THREADS=128 OMP_PROC_BIND=true BINARY.x
+```
+This enables effective run over all 128 cores available on a single Karlina compute node.
 
+### OpenMP with MPI
+To compile the code against the LIBSCI, with MPI, use:
 ```code
-OMP_NUM_THREADS = 64
-GOMP_CPU_AFFINITY=0:63:1
+mpiicx -qopenmp -L$CRAY_LIBSCI_PREFIX_DIR/lib -I$CRAY_LIBSCI_PREFIX_DIR/include -o BINARY.x SURCE_CODE.c  -lsci_intel_mp -lsci_intel_mpi_mp
 ```
 
-However, to get full performance, you have to execute two jobs on two Karolina sockets at the time.
-Other BLAS libraries may be used, however none performs as well as the MKL.
+To run the resulting binary use:
+```code
+OMP_NUM_THREADS=64 OMP_PROC_BIND=true mpirun -n 2 ${HOME}/BINARY.x
+```
+This example runs the BINARY.x, placed in ${HOME} as 2 MPI processes, each using 64 cores of a single socket of a single node.
+
+Another example would be to run a job on 2 full nodes, utilizing 128 cores on each (so 256 cores in total) and letting the LIBSCI efficiently placing the BLAS routines across the allocated CPU sockets:
+```code
+OMP_NUM_THREADS=128 OMP_PROC_BIND=true mpirun -n 2 ${HOME}/BINARY.x
+```
+This assumes you have allocated 2 full nodes on Karolina using SLURM's directives, e. g. in a submission script:
+```code
+#SBATCH --nodes 2
+#SBATCH --ntasks-per-node 128
+```
 
-!!! note
-    Most MPI libraries do the binding automatically. The binding of MPI ranks can be inspected for any MPI by running `$ mpirun -n num_of_ranks numactl --show`. However, if the ranks spawn threads, binding of these threads should be done via the environment variables described above.
+**Don't forget** before the run to ensure you have the correct modules and loaded and that you have set up the LD_LIBRARY_PATH environment variable set as shown above (e.g. part of your submission script for SLURM). 
 
-The choice of BLAS library and its performance may be verified with our benchmark,
-see [Lorenz BLAS performance benchmark][a].
+!!! note  
+Most MPI libraries do the binding automatically. The binding of MPI ranks can be inspected for any MPI by running  `$ mpirun -n num_of_ranks numactl --show`. However, if the ranks spawn threads, binding of these threads should be done via the environment variables described above.
 
-[a]: https://code.it4i.cz/jansik/lorenz/-/blob/main/README.md
+The choice of BLAS library and its performance may be verified with our benchmark,  
+see  [Lorenz BLAS performance benchmark](https://code.it4i.cz/jansik/lorenz/-/blob/main/README.md).
-- 
GitLab