diff --git a/docs.it4i/lumi/pytorch.md b/docs.it4i/lumi/pytorch.md new file mode 100644 index 0000000000000000000000000000000000000000..fa3e54afde555a7d886a2890ac7e29cbf5a87def --- /dev/null +++ b/docs.it4i/lumi/pytorch.md @@ -0,0 +1,194 @@ +# PyTorch + +## PyTorch Highlight + +* Official page: [https://pytorch.org/][1] +* Code: [https://github.com/pytorch/pytorch][2] +* Python-based framework for machine learning + * Auto-differentiation on tensor types +* Official LUMI page: [https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/p/PyTorch/][3] + * **Warning:** be careful where the SIF image is installed or copied ($HOME is not recommended for quota reasons). For EasyBuild you must specify the installation path: `eb PyTorch.eb -r --prefix=$PWD/easybuild`. + +## PyTorch Install + +### Base Environment + +```console +module purge +module load CrayEnv +module load PrgEnv-cray/8.3.3 +module load craype-accel-amd-gfx90a +module load cray-python + +# Default ROCm – more recent versions are preferable (e.g. ROCm 5.6.0) +module load rocm/5.2.3.lua +``` + +### Scripts + +* natively + * [01-install-direct-torch1.13.1-rocm5.2.3.sh](scripts/install/01-install-direct-torch1.13.1-rocm5.2.3.sh) + * [01-install-direct-torch2.1.2-rocm5.5.3.sh](scripts/install/01-install-direct-torch2.1.2-rocm5.5.3.sh) +* virtual env + * [02-install-venv-torch1.13.1-rocm5.2.3.sh](scripts/install/02-install-venv-torch1.13.1-rocm5.2.3.sh) + * [02-install-venv-torch2.1.2-rocm5.5.3.sh](scripts/install/02-install-venv-torch2.1.2-rocm5.5.3.sh) +* conda env + * [03-install-conda-torch1.13.1-rocm5.2.3.sh](scripts/install/03-install-conda-torch1.13.1-rocm5.2.3.sh) + * [03-install-conda-torch2.1.2-rocm5.5.3.sh](scripts/install/03-install-conda-torch2.1.2-rocm5.5.3.sh) + * from source: [04-install-source-torch1.13.1-rocm5.2.3.sh](scripts/install/04-install-source-torch1.13.1-rocm5.2.3.sh) +* containers (singularity) + * [05-install-container-torch2.0.1-rocm5.5.1.sh](scripts/install/05-install-container-torch2.0.1-rocm5.5.1.sh) + * [05-install-container-torch2.1.0-rocm5.6.1.sh](scripts/install/05-install-container-torch2.1.0-rocm5.6.1.sh) + +## PyTorch Tests + +### Run Interactive Job on Single Node + +```console +salloc -A project_XXX --partition=standard-g -N 1 -n 1 --gpus 8 -t 01:00:00 +``` + +### Scripts + +* natively + * [01-simple-test-direct-torch1.13.1-rocm5.2.3.sh](scripts/tests/01-simple-test-direct-torch1.13.1-rocm5.2.3.sh) + * [01-simple-test-direct-torch2.1.2-rocm5.5.3.sh](scripts/tests/01-simple-test-direct-torch2.1.2-rocm5.5.3.sh) +* virtual env + * [02-simple-test-venv-torch1.13.1-rocm5.2.3.sh](scripts/tests/02-simple-test-venv-torch1.13.1-rocm5.2.3.sh) + * [02-simple-test-venv-torch2.1.2-rocm5.5.3.sh](scripts/tests/02-simple-test-venv-torch2.1.2-rocm5.5.3.sh) +* conda env + * [03-simple-test-conda-torch1.13.1-rocm5.2.3.sh](scripts/tests/03-simple-test-conda-torch1.13.1-rocm5.2.3.sh) + * [03-simple-test-conda-torch2.1.2-rocm5.5.3.sh](scripts/tests/03-simple-test-conda-torch2.1.2-rocm5.5.3.sh) + * from source: [04-simple-test-source-torch1.13.1-rocm5.2.3.sh](scripts/tests/04-simple-test-source-torch1.13.1-rocm5.2.3.sh) +* containers (singularity) + * [05-simple-test-container-torch2.0.1-rocm5.5.1.sh](scripts/tests/05-simple-test-container-torch2.0.1-rocm5.5.1.sh) + * [05-simple-test-container-torch2.1.0-rocm5.6.1.sh](scripts/tests/05-simple-test-container-torch2.1.0-rocm5.6.1.sh) + +### Run Interactive Job on Multiple Nodes + +``` +salloc -A project_XXX --partition=standard-g -N 2 -n 16 --gpus 16 -t 01:00:00 +``` + +### Scripts + +* containers (singularity) + * [07-mnist-distributed-learning-container-torch2.0.1-rocm5.5.1.sh](scripts/tests/07-mnist-distributed-learning-container-torch2.0.1-rocm5.5.1.sh) + * [07-mnist-distributed-learning-container-torch2.1.0-rocm5.6.1.sh](scripts/tests/07-mnist-distributed-learning-container-torch2.1.0-rocm5.6.1.sh) + * [08-cnn-distributed-container-torch2.0.1-rocm5.5.1.sh](scripts/tests/08-cnn-distributed-container-torch2.0.1-rocm5.5.1.sh) + * [08-cnn-distributed-container-torch2.1.0-rocm5.6.1.sh](scripts/tests/08-cnn-distributed-container-torch2.1.0-rocm5.6.1.sh) + +## Tips + +### Official Containers + +``` +ls -la /appl/local/containers/easybuild-sif-images/ +``` + +### Unofficial Versions of ROCM + +``` +module use /pfs/lustrep2/projappl/project_462000125/samantao-public/mymodules +ml rocm/5.4.3 +ml rocm/5.6.0 +``` + +### Unofficial Containers + +``` +ls -la /pfs/lustrep2/projappl/project_462000125/samantao-public/containers/ +``` + +### Installing Python Modules in Containers + +```console +#!/bin/bash + +wd=$(pwd) +SIF=/pfs/lustrep2/projappl/project_462000125/samantao-public/containers/lumi-pytorch-rocm-5.6.1-python-3.10-pytorch-v2.1.0-dockerhash-aa8dbea5e0e4.sif + +rm -rf $wd/setup-me.sh +cat > $wd/setup-me.sh << EOF +#!/bin/bash -e + +\$WITH_CONDA +pip3 install scipy h5py tqdm +EOF +chmod +x $wd/setup-me.sh + +mkdir -p $wd/pip_install + +srun -n 1 --gpus 8 singularity exec \ +-B /var/spool/slurmd:/var/spool/slurmd \ +-B /opt/cray:/opt/cray \ +-B /usr/lib64/libcxi.so.1:/usr/lib64/libcxi.so.1 \ +-B $wd:/workdir \ +-B $wd/pip_install:$HOME/.local/lib \ +$SIF /workdir/setup-me.sh + +# Add the path of pip_install to singularity-exec in run.sh: +# -B $wd/pip_install:$HOME/.local/lib \ +``` + +### Controlling Device Visibility + +* `HIP_VISIBLE_DEVICES=0,1,2,3 python -c 'import torch; print(torch.cuda.device_count())'` +* `ROCR_VISIBLE_DEVICES=0,1,2,3 python -c 'import torch; print(torch.cuda.device_count())'` +* SLURM sets `ROCR_VISIBLE_DEVICES` +* Implications of both ways of setting visibility – blit kernels and/or DMA + +### RCCL + +* The problem – on startup we can see: + * `NCCL error in: /pfs/lustrep2/projappl/project_462000125/samantao/pytorchexample/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, unhandled system error, NCCL version 2.12.12` +* Checking error origin: + * `export NCCL_DEBUG=INFO` + * `NCCL INFO NET/Socket : Using [0]nmn0:10.120.116.65<0> [1]hsn0:10.253.6.67<0> [2]hsn1:10.253.6.68<0>[3]hsn2:10.253.2.12<0> [4]hsn3:10.253.2.11<0>` + * `NCCL INFO /long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/rccl/src/init.cc:1292` +* The fix: + * `export NCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3` + +### RCCL AWS-CXI Plugin + +* RCCL relies on runtime plugin-ins to connect with some transport layers + * Libfabric – provider for Slingshot +* Hipified plugin adapted from AWS OpenFabrics support available +* [https://github.com/ROCmSoftwarePlatform/aws-ofi-rccl][7] +* 3-4x faster collectives +* Plugin needs to be pointed at by the loading environment + +```console +module use /pfs/lustrep2/projappl/project_462000125/samantao-public/mymodules +module load aws-ofi-rccl/rocm-5.2.3.lua +# Or +export LD_LIBRARY_PATH=/pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofirccl +# (will detect librccl-net.so) +``` + +* Verify the plugin is detected + +```console +export NCCL_DEBUG=INFO +export NCCL_DEBUG_SUBSYS=INIT +# and search the logs for: +# [0] NCCL INFO NET/OFI Using aws-ofi-rccl 1.4.0 +``` + +### amdgpu.ids Issue + +[https://github.com/pytorch/builder/issues/1410][4] + +## References + +* Samuel Antao (AMD), LUMI Courses +* [https://lumi-supercomputer.github.io/LUMI-training-materials/4day-20230530/extra_4_10_Best_Practices_GPU_Optimization/][5] +* [https://lumi-supercomputer.github.io/LUMI-training-materials/4day-20231003/extra_4_10_Best_Practices_GPU_Optimization/][6] + +[1]: https://pytorch.org/ +[2]: https://github.com/pytorch/pytorch +[3]: https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/p/PyTorch/ +[4]: https://github.com/pytorch/builder/issues/1410 +[5]: https://lumi-supercomputer.github.io/LUMI-training-materials/4day-20230530/extra_4_10_Best_Practices_GPU_Optimization/ +[6]: https://lumi-supercomputer.github.io/LUMI-training-materials/4day-20231003/extra_4_10_Best_Practices_GPU_Optimization/ +[7]: https://github.com/ROCmSoftwarePlatform/aws-ofi-rccl diff --git a/docs.it4i/lumi/software.md b/docs.it4i/lumi/software.md index bc414036e2a123a2599c049be2f3f21ee1e86759..ce11667840e101a3b1cf94cabd303f648dc32469 100644 --- a/docs.it4i/lumi/software.md +++ b/docs.it4i/lumi/software.md @@ -1,11 +1,19 @@ # LUMI Software -Below are the guides for selected [LUMI Software modules][1]: +Below are links to LUMI guides for selected [LUMI Software modules][1]: -## How to Run PyTorch on Lumi-G AMD GPU Accelerators +## PyTorch [PyTorch][8] is an optimized tensor library for deep learning using GPUs and CPUs. +### Comprehensive Guide on PyTorch + +See the [PyTorch][a] subsection for guides on how to install PyTorch and run interactive jobs. + +### How to Run PyTorch on Lumi-G AMD GPU Accelerators + +Link to LUMI guide on how to run PyTorch on LUMI GPUs: + [https://docs.lumi-supercomputer.eu/software/packages/pytorch/][2] ## How to Run Gromacs on Lumi-G AMD GPU Accelerators @@ -53,3 +61,5 @@ Conda is an open-source, cross-platform,language-agnostic package manager and en [6]: https://docs.lumi-supercomputer.eu/software/local/csc/ [7]: https://docs.lumi-supercomputer.eu/software/installing/container-wrapper/ [8]: https://pytorch.org/docs/stable/index.html + +[a]: pytorch.md diff --git a/mkdocs.yml b/mkdocs.yml index c33781664b100c4bbcd0385ed240fca6ae478125..9ed42908f3b81b34a856cc4630c1699418f64ec9 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -302,7 +302,9 @@ nav: - VESTA: software/viz/vesta.md - LUMI: - About LUMI: lumi/about.md - - LUMI Software: lumi/software.md + - LUMI Software: + - General: lumi/software.md + - PyTorch: lumi/pytorch.md - LUMI Support: lumi/support.md - Clouds: - e-INFRA CZ Cloud: cloud/einfracz-cloud.md