Newer
Older
# PyTorch
## PyTorch Highlight
* Official page: [https://pytorch.org/][1]
* Code: [https://github.com/pytorch/pytorch][2]
* Python-based framework for machine learning
* Auto-differentiation on tensor types
* Official LUMI page: [https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/p/PyTorch/][3]
* **Warning:** be careful where the SIF image is installed or copied ($HOME is not recommended for quota reasons). For EasyBuild you must specify the installation path: `export EBU_USER_PREFIX=/project/project_XXXX/EasyBuild`.
## PyTorch Install
### Base Environment
```console
module purge
module load CrayEnv
module load PrgEnv-cray/8.3.3
module load craype-accel-amd-gfx90a
module load cray-python
# Default ROCm – more recent versions are preferable (e.g. ROCm 5.6.0)
module load rocm/5.2.3.lua
```
### Scripts
* natively
* [01-install-direct-torch1.13.1-rocm5.2.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/install/01-install-direct-torch1.13.1-rocm5.2.3.sh)
* [01-install-direct-torch2.1.2-rocm5.5.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/install/01-install-direct-torch2.1.2-rocm5.5.3.sh)
* [02-install-venv-torch1.13.1-rocm5.2.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/install/02-install-venv-torch1.13.1-rocm5.2.3.sh)
* [02-install-venv-torch2.1.2-rocm5.5.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/install/02-install-venv-torch2.1.2-rocm5.5.3.sh)
* [03-install-conda-torch1.13.1-rocm5.2.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/install/03-install-conda-torch1.13.1-rocm5.2.3.sh)
* [03-install-conda-torch2.1.2-rocm5.5.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/install/03-install-conda-torch2.1.2-rocm5.5.3.sh)
* from source: [04-install-source-torch1.13.1-rocm5.2.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/install/04-install-source-torch1.13.1-rocm5.2.3.sh)
* [05-install-container-torch2.0.1-rocm5.5.1.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/install/05-install-container-torch2.0.1-rocm5.5.1.sh)
* [05-install-container-torch2.1.0-rocm5.6.1.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/install/05-install-container-torch2.1.0-rocm5.6.1.sh)
## PyTorch Tests
### Run Interactive Job on Single Node
```console
salloc -A project_XXX --partition=standard-g -N 1 -n 1 --gpus 8 -t 01:00:00
```
### Scripts
* natively
* [01-simple-test-direct-torch1.13.1-rocm5.2.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/01-simple-test-direct-torch1.13.1-rocm5.2.3.sh)
* [01-simple-test-direct-torch2.1.2-rocm5.5.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/01-simple-test-direct-torch2.1.2-rocm5.5.3.sh)
* [02-simple-test-venv-torch1.13.1-rocm5.2.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/02-simple-test-venv-torch1.13.1-rocm5.2.3.sh)
* [02-simple-test-venv-torch2.1.2-rocm5.5.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/02-simple-test-venv-torch2.1.2-rocm5.5.3.sh)
* [03-simple-test-conda-torch1.13.1-rocm5.2.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/03-simple-test-conda-torch1.13.1-rocm5.2.3.sh)
* [03-simple-test-conda-torch2.1.2-rocm5.5.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/03-simple-test-conda-torch2.1.2-rocm5.5.3.sh)
* from source: [04-simple-test-source-torch1.13.1-rocm5.2.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/04-simple-test-source-torch1.13.1-rocm5.2.3.sh)
* [05-simple-test-container-torch2.0.1-rocm5.5.1.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/05-simple-test-container-torch2.0.1-rocm5.5.1.sh)
* [05-simple-test-container-torch2.1.0-rocm5.6.1.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/05-simple-test-container-torch2.1.0-rocm5.6.1.sh)
### Run Interactive Job on Multiple Nodes
```
salloc -A project_XXX --partition=standard-g -N 2 -n 16 --gpus 16 -t 01:00:00
```
### Scripts
* containers (singularity)
* [07-mnist-distributed-learning-container-torch2.0.1-rocm5.5.1.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/07-mnist-distributed-learning-container-torch2.0.1-rocm5.5.1.sh)
* [07-mnist-distributed-learning-container-torch2.1.0-rocm5.6.1.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/07-mnist-distributed-learning-container-torch2.1.0-rocm5.6.1.sh)
* [08-cnn-distributed-container-torch2.0.1-rocm5.5.1.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/08-cnn-distributed-container-torch2.0.1-rocm5.5.1.sh)
* [08-cnn-distributed-container-torch2.1.0-rocm5.6.1.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/08-cnn-distributed-container-torch2.1.0-rocm5.6.1.sh)
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
## Tips
### Official Containers
```
ls -la /appl/local/containers/easybuild-sif-images/
```
### Unofficial Versions of ROCM
```
module use /pfs/lustrep2/projappl/project_462000125/samantao-public/mymodules
ml rocm/5.4.3
ml rocm/5.6.0
```
### Unofficial Containers
```
ls -la /pfs/lustrep2/projappl/project_462000125/samantao-public/containers/
```
### Installing Python Modules in Containers
```console
#!/bin/bash
wd=$(pwd)
SIF=/pfs/lustrep2/projappl/project_462000125/samantao-public/containers/lumi-pytorch-rocm-5.6.1-python-3.10-pytorch-v2.1.0-dockerhash-aa8dbea5e0e4.sif
rm -rf $wd/setup-me.sh
cat > $wd/setup-me.sh << EOF
#!/bin/bash -e
\$WITH_CONDA
pip3 install scipy h5py tqdm
EOF
chmod +x $wd/setup-me.sh
mkdir -p $wd/pip_install
srun -n 1 --gpus 8 singularity exec \
-B /var/spool/slurmd:/var/spool/slurmd \
-B /opt/cray:/opt/cray \
-B /usr/lib64/libcxi.so.1:/usr/lib64/libcxi.so.1 \
-B $wd:/workdir \
-B $wd/pip_install:$HOME/.local/lib \
$SIF /workdir/setup-me.sh
# Add the path of pip_install to singularity-exec in run.sh:
# -B $wd/pip_install:$HOME/.local/lib \
```
### Controlling Device Visibility
* `HIP_VISIBLE_DEVICES=0,1,2,3 python -c 'import torch; print(torch.cuda.device_count())'`
* `ROCR_VISIBLE_DEVICES=0,1,2,3 python -c 'import torch; print(torch.cuda.device_count())'`
* SLURM sets `ROCR_VISIBLE_DEVICES`
* Implications of both ways of setting visibility – blit kernels and/or DMA
### RCCL
* The problem – on startup we can see:
* `NCCL error in: /pfs/lustrep2/projappl/project_462000125/samantao/pytorchexample/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, unhandled system error, NCCL version 2.12.12`
* `export NCCL_DEBUG=INFO`
* `NCCL INFO NET/Socket : Using [0]nmn0:10.120.116.65<0> [1]hsn0:10.253.6.67<0> [2]hsn1:10.253.6.68<0>[3]hsn2:10.253.2.12<0> [4]hsn3:10.253.2.11<0>`
* `NCCL INFO /long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/rccl/src/init.cc:1292`
### RCCL AWS-CXI Plugin
* RCCL relies on runtime plugin-ins to connect with some transport layers
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
* Hipified plugin adapted from AWS OpenFabrics support available
* [https://github.com/ROCmSoftwarePlatform/aws-ofi-rccl][7]
* 3-4x faster collectives
* Plugin needs to be pointed at by the loading environment
```console
module use /pfs/lustrep2/projappl/project_462000125/samantao-public/mymodules
module load aws-ofi-rccl/rocm-5.2.3.lua
# Or
export LD_LIBRARY_PATH=/pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofirccl
# (will detect librccl-net.so)
```
* Verify the plugin is detected
```console
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=INIT
# and search the logs for:
# [0] NCCL INFO NET/OFI Using aws-ofi-rccl 1.4.0
```
### amdgpu.ids Issue
[https://github.com/pytorch/builder/issues/1410][4]
## References
* Samuel Antao (AMD), LUMI Courses
* [https://lumi-supercomputer.github.io/LUMI-training-materials/4day-20230530/extra_4_10_Best_Practices_GPU_Optimization/][5]
* [https://lumi-supercomputer.github.io/LUMI-training-materials/4day-20231003/extra_4_10_Best_Practices_GPU_Optimization/][6]
[1]: https://pytorch.org/
[2]: https://github.com/pytorch/pytorch
[3]: https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/p/PyTorch/
[4]: https://github.com/pytorch/builder/issues/1410
[5]: https://lumi-supercomputer.github.io/LUMI-training-materials/4day-20230530/extra_4_10_Best_Practices_GPU_Optimization/
[6]: https://lumi-supercomputer.github.io/LUMI-training-materials/4day-20231003/extra_4_10_Best_Practices_GPU_Optimization/
[7]: https://github.com/ROCmSoftwarePlatform/aws-ofi-rccl