Skip to content
Snippets Groups Projects
Commit ba81a12c authored by Jan Siwiec's avatar Jan Siwiec
Browse files

Merge branch 'lumi-pytorch' into 'master'

Lumi pytorch

See merge request !459
parents 36268a56 62d769af
No related branches found
No related tags found
1 merge request!459Lumi pytorch
Pipeline #36463 failed
# PyTorch
## PyTorch Highlight
* Official page: [https://pytorch.org/][1]
* Code: [https://github.com/pytorch/pytorch][2]
* Python-based framework for machine learning
* Auto-differentiation on tensor types
* Official LUMI page: [https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/p/PyTorch/][3]
* **Warning:** be careful where the SIF image is installed or copied ($HOME is not recommended for quota reasons). For EasyBuild you must specify the installation path: `eb PyTorch.eb -r --prefix=$PWD/easybuild`.
## PyTorch Install
### Base Environment
```console
module purge
module load CrayEnv
module load PrgEnv-cray/8.3.3
module load craype-accel-amd-gfx90a
module load cray-python
# Default ROCm – more recent versions are preferable (e.g. ROCm 5.6.0)
module load rocm/5.2.3.lua
```
### Scripts
* natively
* [01-install-direct-torch1.13.1-rocm5.2.3.sh](scripts/install/01-install-direct-torch1.13.1-rocm5.2.3.sh)
* [01-install-direct-torch2.1.2-rocm5.5.3.sh](scripts/install/01-install-direct-torch2.1.2-rocm5.5.3.sh)
* virtual env
* [02-install-venv-torch1.13.1-rocm5.2.3.sh](scripts/install/02-install-venv-torch1.13.1-rocm5.2.3.sh)
* [02-install-venv-torch2.1.2-rocm5.5.3.sh](scripts/install/02-install-venv-torch2.1.2-rocm5.5.3.sh)
* conda env
* [03-install-conda-torch1.13.1-rocm5.2.3.sh](scripts/install/03-install-conda-torch1.13.1-rocm5.2.3.sh)
* [03-install-conda-torch2.1.2-rocm5.5.3.sh](scripts/install/03-install-conda-torch2.1.2-rocm5.5.3.sh)
* from source: [04-install-source-torch1.13.1-rocm5.2.3.sh](scripts/install/04-install-source-torch1.13.1-rocm5.2.3.sh)
* containers (singularity)
* [05-install-container-torch2.0.1-rocm5.5.1.sh](scripts/install/05-install-container-torch2.0.1-rocm5.5.1.sh)
* [05-install-container-torch2.1.0-rocm5.6.1.sh](scripts/install/05-install-container-torch2.1.0-rocm5.6.1.sh)
## PyTorch Tests
### Run Interactive Job on Single Node
```console
salloc -A project_XXX --partition=standard-g -N 1 -n 1 --gpus 8 -t 01:00:00
```
### Scripts
* natively
* [01-simple-test-direct-torch1.13.1-rocm5.2.3.sh](scripts/tests/01-simple-test-direct-torch1.13.1-rocm5.2.3.sh)
* [01-simple-test-direct-torch2.1.2-rocm5.5.3.sh](scripts/tests/01-simple-test-direct-torch2.1.2-rocm5.5.3.sh)
* virtual env
* [02-simple-test-venv-torch1.13.1-rocm5.2.3.sh](scripts/tests/02-simple-test-venv-torch1.13.1-rocm5.2.3.sh)
* [02-simple-test-venv-torch2.1.2-rocm5.5.3.sh](scripts/tests/02-simple-test-venv-torch2.1.2-rocm5.5.3.sh)
* conda env
* [03-simple-test-conda-torch1.13.1-rocm5.2.3.sh](scripts/tests/03-simple-test-conda-torch1.13.1-rocm5.2.3.sh)
* [03-simple-test-conda-torch2.1.2-rocm5.5.3.sh](scripts/tests/03-simple-test-conda-torch2.1.2-rocm5.5.3.sh)
* from source: [04-simple-test-source-torch1.13.1-rocm5.2.3.sh](scripts/tests/04-simple-test-source-torch1.13.1-rocm5.2.3.sh)
* containers (singularity)
* [05-simple-test-container-torch2.0.1-rocm5.5.1.sh](scripts/tests/05-simple-test-container-torch2.0.1-rocm5.5.1.sh)
* [05-simple-test-container-torch2.1.0-rocm5.6.1.sh](scripts/tests/05-simple-test-container-torch2.1.0-rocm5.6.1.sh)
### Run Interactive Job on Multiple Nodes
```
salloc -A project_XXX --partition=standard-g -N 2 -n 16 --gpus 16 -t 01:00:00
```
### Scripts
* containers (singularity)
* [07-mnist-distributed-learning-container-torch2.0.1-rocm5.5.1.sh](scripts/tests/07-mnist-distributed-learning-container-torch2.0.1-rocm5.5.1.sh)
* [07-mnist-distributed-learning-container-torch2.1.0-rocm5.6.1.sh](scripts/tests/07-mnist-distributed-learning-container-torch2.1.0-rocm5.6.1.sh)
* [08-cnn-distributed-container-torch2.0.1-rocm5.5.1.sh](scripts/tests/08-cnn-distributed-container-torch2.0.1-rocm5.5.1.sh)
* [08-cnn-distributed-container-torch2.1.0-rocm5.6.1.sh](scripts/tests/08-cnn-distributed-container-torch2.1.0-rocm5.6.1.sh)
## Tips
### Official Containers
```
ls -la /appl/local/containers/easybuild-sif-images/
```
### Unofficial Versions of ROCM
```
module use /pfs/lustrep2/projappl/project_462000125/samantao-public/mymodules
ml rocm/5.4.3
ml rocm/5.6.0
```
### Unofficial Containers
```
ls -la /pfs/lustrep2/projappl/project_462000125/samantao-public/containers/
```
### Installing Python Modules in Containers
```console
#!/bin/bash
wd=$(pwd)
SIF=/pfs/lustrep2/projappl/project_462000125/samantao-public/containers/lumi-pytorch-rocm-5.6.1-python-3.10-pytorch-v2.1.0-dockerhash-aa8dbea5e0e4.sif
rm -rf $wd/setup-me.sh
cat > $wd/setup-me.sh << EOF
#!/bin/bash -e
\$WITH_CONDA
pip3 install scipy h5py tqdm
EOF
chmod +x $wd/setup-me.sh
mkdir -p $wd/pip_install
srun -n 1 --gpus 8 singularity exec \
-B /var/spool/slurmd:/var/spool/slurmd \
-B /opt/cray:/opt/cray \
-B /usr/lib64/libcxi.so.1:/usr/lib64/libcxi.so.1 \
-B $wd:/workdir \
-B $wd/pip_install:$HOME/.local/lib \
$SIF /workdir/setup-me.sh
# Add the path of pip_install to singularity-exec in run.sh:
# -B $wd/pip_install:$HOME/.local/lib \
```
### Controlling Device Visibility
* `HIP_VISIBLE_DEVICES=0,1,2,3 python -c 'import torch; print(torch.cuda.device_count())'`
* `ROCR_VISIBLE_DEVICES=0,1,2,3 python -c 'import torch; print(torch.cuda.device_count())'`
* SLURM sets `ROCR_VISIBLE_DEVICES`
* Implications of both ways of setting visibility – blit kernels and/or DMA
### RCCL
* The problem – on startup we can see:
* `NCCL error in: /pfs/lustrep2/projappl/project_462000125/samantao/pytorchexample/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, unhandled system error, NCCL version 2.12.12`
* Checking error origin:
* `export NCCL_DEBUG=INFO`
* `NCCL INFO NET/Socket : Using [0]nmn0:10.120.116.65<0> [1]hsn0:10.253.6.67<0> [2]hsn1:10.253.6.68<0>[3]hsn2:10.253.2.12<0> [4]hsn3:10.253.2.11<0>`
* `NCCL INFO /long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/rccl/src/init.cc:1292`
* The fix:
* `export NCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3`
### RCCL AWS-CXI Plugin
* RCCL relies on runtime plugin-ins to connect with some transport layers
* Libfabric – provider for Slingshot
* Hipified plugin adapted from AWS OpenFabrics support available
* [https://github.com/ROCmSoftwarePlatform/aws-ofi-rccl][7]
* 3-4x faster collectives
* Plugin needs to be pointed at by the loading environment
```console
module use /pfs/lustrep2/projappl/project_462000125/samantao-public/mymodules
module load aws-ofi-rccl/rocm-5.2.3.lua
# Or
export LD_LIBRARY_PATH=/pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofirccl
# (will detect librccl-net.so)
```
* Verify the plugin is detected
```console
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=INIT
# and search the logs for:
# [0] NCCL INFO NET/OFI Using aws-ofi-rccl 1.4.0
```
### amdgpu.ids Issue
[https://github.com/pytorch/builder/issues/1410][4]
## References
* Samuel Antao (AMD), LUMI Courses
* [https://lumi-supercomputer.github.io/LUMI-training-materials/4day-20230530/extra_4_10_Best_Practices_GPU_Optimization/][5]
* [https://lumi-supercomputer.github.io/LUMI-training-materials/4day-20231003/extra_4_10_Best_Practices_GPU_Optimization/][6]
[1]: https://pytorch.org/
[2]: https://github.com/pytorch/pytorch
[3]: https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/p/PyTorch/
[4]: https://github.com/pytorch/builder/issues/1410
[5]: https://lumi-supercomputer.github.io/LUMI-training-materials/4day-20230530/extra_4_10_Best_Practices_GPU_Optimization/
[6]: https://lumi-supercomputer.github.io/LUMI-training-materials/4day-20231003/extra_4_10_Best_Practices_GPU_Optimization/
[7]: https://github.com/ROCmSoftwarePlatform/aws-ofi-rccl
# LUMI Software # LUMI Software
Below are the guides for selected [LUMI Software modules][1]: Below are links to LUMI guides for selected [LUMI Software modules][1]:
## How to Run PyTorch on Lumi-G AMD GPU Accelerators ## PyTorch
[PyTorch][8] is an optimized tensor library for deep learning using GPUs and CPUs. [PyTorch][8] is an optimized tensor library for deep learning using GPUs and CPUs.
### Comprehensive Guide on PyTorch
See the [PyTorch][a] subsection for guides on how to install PyTorch and run interactive jobs.
### How to Run PyTorch on Lumi-G AMD GPU Accelerators
Link to LUMI guide on how to run PyTorch on LUMI GPUs:
[https://docs.lumi-supercomputer.eu/software/packages/pytorch/][2] [https://docs.lumi-supercomputer.eu/software/packages/pytorch/][2]
## How to Run Gromacs on Lumi-G AMD GPU Accelerators ## How to Run Gromacs on Lumi-G AMD GPU Accelerators
...@@ -53,3 +61,5 @@ Conda is an open-source, cross-platform,language-agnostic package manager and en ...@@ -53,3 +61,5 @@ Conda is an open-source, cross-platform,language-agnostic package manager and en
[6]: https://docs.lumi-supercomputer.eu/software/local/csc/ [6]: https://docs.lumi-supercomputer.eu/software/local/csc/
[7]: https://docs.lumi-supercomputer.eu/software/installing/container-wrapper/ [7]: https://docs.lumi-supercomputer.eu/software/installing/container-wrapper/
[8]: https://pytorch.org/docs/stable/index.html [8]: https://pytorch.org/docs/stable/index.html
[a]: pytorch.md
...@@ -302,7 +302,9 @@ nav: ...@@ -302,7 +302,9 @@ nav:
- VESTA: software/viz/vesta.md - VESTA: software/viz/vesta.md
- LUMI: - LUMI:
- About LUMI: lumi/about.md - About LUMI: lumi/about.md
- LUMI Software: lumi/software.md - LUMI Software:
- General: lumi/software.md
- PyTorch: lumi/pytorch.md
- LUMI Support: lumi/support.md - LUMI Support: lumi/support.md
- Clouds: - Clouds:
- e-INFRA CZ Cloud: cloud/einfracz-cloud.md - e-INFRA CZ Cloud: cloud/einfracz-cloud.md
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment