Skip to content
Snippets Groups Projects
Commit 0fe94b6c authored by Jan Siwiec's avatar Jan Siwiec
Browse files

Update pytorch.md

parent ba81a12c
No related branches found
No related tags found
No related merge requests found
Pipeline #36467 failed
...@@ -27,18 +27,18 @@ module load rocm/5.2.3.lua ...@@ -27,18 +27,18 @@ module load rocm/5.2.3.lua
### Scripts ### Scripts
* natively * natively
* [01-install-direct-torch1.13.1-rocm5.2.3.sh](scripts/install/01-install-direct-torch1.13.1-rocm5.2.3.sh) * [01-install-direct-torch1.13.1-rocm5.2.3.sh](scripts/install/01-install-direct-torch1.13.1-rocm5.2.3.sh)
* [01-install-direct-torch2.1.2-rocm5.5.3.sh](scripts/install/01-install-direct-torch2.1.2-rocm5.5.3.sh) * [01-install-direct-torch2.1.2-rocm5.5.3.sh](scripts/install/01-install-direct-torch2.1.2-rocm5.5.3.sh)
* virtual env * virtual env
* [02-install-venv-torch1.13.1-rocm5.2.3.sh](scripts/install/02-install-venv-torch1.13.1-rocm5.2.3.sh) * [02-install-venv-torch1.13.1-rocm5.2.3.sh](scripts/install/02-install-venv-torch1.13.1-rocm5.2.3.sh)
* [02-install-venv-torch2.1.2-rocm5.5.3.sh](scripts/install/02-install-venv-torch2.1.2-rocm5.5.3.sh) * [02-install-venv-torch2.1.2-rocm5.5.3.sh](scripts/install/02-install-venv-torch2.1.2-rocm5.5.3.sh)
* conda env * conda env
* [03-install-conda-torch1.13.1-rocm5.2.3.sh](scripts/install/03-install-conda-torch1.13.1-rocm5.2.3.sh) * [03-install-conda-torch1.13.1-rocm5.2.3.sh](scripts/install/03-install-conda-torch1.13.1-rocm5.2.3.sh)
* [03-install-conda-torch2.1.2-rocm5.5.3.sh](scripts/install/03-install-conda-torch2.1.2-rocm5.5.3.sh) * [03-install-conda-torch2.1.2-rocm5.5.3.sh](scripts/install/03-install-conda-torch2.1.2-rocm5.5.3.sh)
* from source: [04-install-source-torch1.13.1-rocm5.2.3.sh](scripts/install/04-install-source-torch1.13.1-rocm5.2.3.sh) * from source: [04-install-source-torch1.13.1-rocm5.2.3.sh](scripts/install/04-install-source-torch1.13.1-rocm5.2.3.sh)
* containers (singularity) * containers (singularity)
* [05-install-container-torch2.0.1-rocm5.5.1.sh](scripts/install/05-install-container-torch2.0.1-rocm5.5.1.sh) * [05-install-container-torch2.0.1-rocm5.5.1.sh](scripts/install/05-install-container-torch2.0.1-rocm5.5.1.sh)
* [05-install-container-torch2.1.0-rocm5.6.1.sh](scripts/install/05-install-container-torch2.1.0-rocm5.6.1.sh) * [05-install-container-torch2.1.0-rocm5.6.1.sh](scripts/install/05-install-container-torch2.1.0-rocm5.6.1.sh)
## PyTorch Tests ## PyTorch Tests
...@@ -51,18 +51,18 @@ salloc -A project_XXX --partition=standard-g -N 1 -n 1 --gpus 8 -t 01:00:00 ...@@ -51,18 +51,18 @@ salloc -A project_XXX --partition=standard-g -N 1 -n 1 --gpus 8 -t 01:00:00
### Scripts ### Scripts
* natively * natively
* [01-simple-test-direct-torch1.13.1-rocm5.2.3.sh](scripts/tests/01-simple-test-direct-torch1.13.1-rocm5.2.3.sh) * [01-simple-test-direct-torch1.13.1-rocm5.2.3.sh](scripts/tests/01-simple-test-direct-torch1.13.1-rocm5.2.3.sh)
* [01-simple-test-direct-torch2.1.2-rocm5.5.3.sh](scripts/tests/01-simple-test-direct-torch2.1.2-rocm5.5.3.sh) * [01-simple-test-direct-torch2.1.2-rocm5.5.3.sh](scripts/tests/01-simple-test-direct-torch2.1.2-rocm5.5.3.sh)
* virtual env * virtual env
* [02-simple-test-venv-torch1.13.1-rocm5.2.3.sh](scripts/tests/02-simple-test-venv-torch1.13.1-rocm5.2.3.sh) * [02-simple-test-venv-torch1.13.1-rocm5.2.3.sh](scripts/tests/02-simple-test-venv-torch1.13.1-rocm5.2.3.sh)
* [02-simple-test-venv-torch2.1.2-rocm5.5.3.sh](scripts/tests/02-simple-test-venv-torch2.1.2-rocm5.5.3.sh) * [02-simple-test-venv-torch2.1.2-rocm5.5.3.sh](scripts/tests/02-simple-test-venv-torch2.1.2-rocm5.5.3.sh)
* conda env * conda env
* [03-simple-test-conda-torch1.13.1-rocm5.2.3.sh](scripts/tests/03-simple-test-conda-torch1.13.1-rocm5.2.3.sh) * [03-simple-test-conda-torch1.13.1-rocm5.2.3.sh](scripts/tests/03-simple-test-conda-torch1.13.1-rocm5.2.3.sh)
* [03-simple-test-conda-torch2.1.2-rocm5.5.3.sh](scripts/tests/03-simple-test-conda-torch2.1.2-rocm5.5.3.sh) * [03-simple-test-conda-torch2.1.2-rocm5.5.3.sh](scripts/tests/03-simple-test-conda-torch2.1.2-rocm5.5.3.sh)
* from source: [04-simple-test-source-torch1.13.1-rocm5.2.3.sh](scripts/tests/04-simple-test-source-torch1.13.1-rocm5.2.3.sh) * from source: [04-simple-test-source-torch1.13.1-rocm5.2.3.sh](scripts/tests/04-simple-test-source-torch1.13.1-rocm5.2.3.sh)
* containers (singularity) * containers (singularity)
* [05-simple-test-container-torch2.0.1-rocm5.5.1.sh](scripts/tests/05-simple-test-container-torch2.0.1-rocm5.5.1.sh) * [05-simple-test-container-torch2.0.1-rocm5.5.1.sh](scripts/tests/05-simple-test-container-torch2.0.1-rocm5.5.1.sh)
* [05-simple-test-container-torch2.1.0-rocm5.6.1.sh](scripts/tests/05-simple-test-container-torch2.1.0-rocm5.6.1.sh) * [05-simple-test-container-torch2.1.0-rocm5.6.1.sh](scripts/tests/05-simple-test-container-torch2.1.0-rocm5.6.1.sh)
### Run Interactive Job on Multiple Nodes ### Run Interactive Job on Multiple Nodes
...@@ -73,10 +73,10 @@ salloc -A project_XXX --partition=standard-g -N 2 -n 16 --gpus 16 -t 01:00:00 ...@@ -73,10 +73,10 @@ salloc -A project_XXX --partition=standard-g -N 2 -n 16 --gpus 16 -t 01:00:00
### Scripts ### Scripts
* containers (singularity) * containers (singularity)
* [07-mnist-distributed-learning-container-torch2.0.1-rocm5.5.1.sh](scripts/tests/07-mnist-distributed-learning-container-torch2.0.1-rocm5.5.1.sh) * [07-mnist-distributed-learning-container-torch2.0.1-rocm5.5.1.sh](scripts/tests/07-mnist-distributed-learning-container-torch2.0.1-rocm5.5.1.sh)
* [07-mnist-distributed-learning-container-torch2.1.0-rocm5.6.1.sh](scripts/tests/07-mnist-distributed-learning-container-torch2.1.0-rocm5.6.1.sh) * [07-mnist-distributed-learning-container-torch2.1.0-rocm5.6.1.sh](scripts/tests/07-mnist-distributed-learning-container-torch2.1.0-rocm5.6.1.sh)
* [08-cnn-distributed-container-torch2.0.1-rocm5.5.1.sh](scripts/tests/08-cnn-distributed-container-torch2.0.1-rocm5.5.1.sh) * [08-cnn-distributed-container-torch2.0.1-rocm5.5.1.sh](scripts/tests/08-cnn-distributed-container-torch2.0.1-rocm5.5.1.sh)
* [08-cnn-distributed-container-torch2.1.0-rocm5.6.1.sh](scripts/tests/08-cnn-distributed-container-torch2.1.0-rocm5.6.1.sh) * [08-cnn-distributed-container-torch2.1.0-rocm5.6.1.sh](scripts/tests/08-cnn-distributed-container-torch2.1.0-rocm5.6.1.sh)
## Tips ## Tips
...@@ -141,18 +141,18 @@ $SIF /workdir/setup-me.sh ...@@ -141,18 +141,18 @@ $SIF /workdir/setup-me.sh
### RCCL ### RCCL
* The problem – on startup we can see: * The problem – on startup we can see:
* `NCCL error in: /pfs/lustrep2/projappl/project_462000125/samantao/pytorchexample/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, unhandled system error, NCCL version 2.12.12` * `NCCL error in: /pfs/lustrep2/projappl/project_462000125/samantao/pytorchexample/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, unhandled system error, NCCL version 2.12.12`
* Checking error origin: * Checking error origin:
* `export NCCL_DEBUG=INFO` * `export NCCL_DEBUG=INFO`
* `NCCL INFO NET/Socket : Using [0]nmn0:10.120.116.65<0> [1]hsn0:10.253.6.67<0> [2]hsn1:10.253.6.68<0>[3]hsn2:10.253.2.12<0> [4]hsn3:10.253.2.11<0>` * `NCCL INFO NET/Socket : Using [0]nmn0:10.120.116.65<0> [1]hsn0:10.253.6.67<0> [2]hsn1:10.253.6.68<0>[3]hsn2:10.253.2.12<0> [4]hsn3:10.253.2.11<0>`
* `NCCL INFO /long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/rccl/src/init.cc:1292` * `NCCL INFO /long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/rccl/src/init.cc:1292`
* The fix: * The fix:
* `export NCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3` * `export NCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3`
### RCCL AWS-CXI Plugin ### RCCL AWS-CXI Plugin
* RCCL relies on runtime plugin-ins to connect with some transport layers * RCCL relies on runtime plugin-ins to connect with some transport layers
* Libfabric – provider for Slingshot * Libfabric – provider for Slingshot
* Hipified plugin adapted from AWS OpenFabrics support available * Hipified plugin adapted from AWS OpenFabrics support available
* [https://github.com/ROCmSoftwarePlatform/aws-ofi-rccl][7] * [https://github.com/ROCmSoftwarePlatform/aws-ofi-rccl][7]
* 3-4x faster collectives * 3-4x faster collectives
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment