diff --git a/docs.it4i/lumi/pytorch.md b/docs.it4i/lumi/pytorch.md index fa3e54afde555a7d886a2890ac7e29cbf5a87def..fad4681e43e0089aeee1e5c25657e0742ac7bfd2 100644 --- a/docs.it4i/lumi/pytorch.md +++ b/docs.it4i/lumi/pytorch.md @@ -27,18 +27,18 @@ module load rocm/5.2.3.lua ### Scripts * natively - * [01-install-direct-torch1.13.1-rocm5.2.3.sh](scripts/install/01-install-direct-torch1.13.1-rocm5.2.3.sh) - * [01-install-direct-torch2.1.2-rocm5.5.3.sh](scripts/install/01-install-direct-torch2.1.2-rocm5.5.3.sh) + * [01-install-direct-torch1.13.1-rocm5.2.3.sh](scripts/install/01-install-direct-torch1.13.1-rocm5.2.3.sh) + * [01-install-direct-torch2.1.2-rocm5.5.3.sh](scripts/install/01-install-direct-torch2.1.2-rocm5.5.3.sh) * virtual env - * [02-install-venv-torch1.13.1-rocm5.2.3.sh](scripts/install/02-install-venv-torch1.13.1-rocm5.2.3.sh) - * [02-install-venv-torch2.1.2-rocm5.5.3.sh](scripts/install/02-install-venv-torch2.1.2-rocm5.5.3.sh) + * [02-install-venv-torch1.13.1-rocm5.2.3.sh](scripts/install/02-install-venv-torch1.13.1-rocm5.2.3.sh) + * [02-install-venv-torch2.1.2-rocm5.5.3.sh](scripts/install/02-install-venv-torch2.1.2-rocm5.5.3.sh) * conda env - * [03-install-conda-torch1.13.1-rocm5.2.3.sh](scripts/install/03-install-conda-torch1.13.1-rocm5.2.3.sh) - * [03-install-conda-torch2.1.2-rocm5.5.3.sh](scripts/install/03-install-conda-torch2.1.2-rocm5.5.3.sh) - * from source: [04-install-source-torch1.13.1-rocm5.2.3.sh](scripts/install/04-install-source-torch1.13.1-rocm5.2.3.sh) + * [03-install-conda-torch1.13.1-rocm5.2.3.sh](scripts/install/03-install-conda-torch1.13.1-rocm5.2.3.sh) + * [03-install-conda-torch2.1.2-rocm5.5.3.sh](scripts/install/03-install-conda-torch2.1.2-rocm5.5.3.sh) + * from source: [04-install-source-torch1.13.1-rocm5.2.3.sh](scripts/install/04-install-source-torch1.13.1-rocm5.2.3.sh) * containers (singularity) - * [05-install-container-torch2.0.1-rocm5.5.1.sh](scripts/install/05-install-container-torch2.0.1-rocm5.5.1.sh) - * [05-install-container-torch2.1.0-rocm5.6.1.sh](scripts/install/05-install-container-torch2.1.0-rocm5.6.1.sh) + * [05-install-container-torch2.0.1-rocm5.5.1.sh](scripts/install/05-install-container-torch2.0.1-rocm5.5.1.sh) + * [05-install-container-torch2.1.0-rocm5.6.1.sh](scripts/install/05-install-container-torch2.1.0-rocm5.6.1.sh) ## PyTorch Tests @@ -51,18 +51,18 @@ salloc -A project_XXX --partition=standard-g -N 1 -n 1 --gpus 8 -t 01:00:00 ### Scripts * natively - * [01-simple-test-direct-torch1.13.1-rocm5.2.3.sh](scripts/tests/01-simple-test-direct-torch1.13.1-rocm5.2.3.sh) - * [01-simple-test-direct-torch2.1.2-rocm5.5.3.sh](scripts/tests/01-simple-test-direct-torch2.1.2-rocm5.5.3.sh) + * [01-simple-test-direct-torch1.13.1-rocm5.2.3.sh](scripts/tests/01-simple-test-direct-torch1.13.1-rocm5.2.3.sh) + * [01-simple-test-direct-torch2.1.2-rocm5.5.3.sh](scripts/tests/01-simple-test-direct-torch2.1.2-rocm5.5.3.sh) * virtual env - * [02-simple-test-venv-torch1.13.1-rocm5.2.3.sh](scripts/tests/02-simple-test-venv-torch1.13.1-rocm5.2.3.sh) - * [02-simple-test-venv-torch2.1.2-rocm5.5.3.sh](scripts/tests/02-simple-test-venv-torch2.1.2-rocm5.5.3.sh) + * [02-simple-test-venv-torch1.13.1-rocm5.2.3.sh](scripts/tests/02-simple-test-venv-torch1.13.1-rocm5.2.3.sh) + * [02-simple-test-venv-torch2.1.2-rocm5.5.3.sh](scripts/tests/02-simple-test-venv-torch2.1.2-rocm5.5.3.sh) * conda env - * [03-simple-test-conda-torch1.13.1-rocm5.2.3.sh](scripts/tests/03-simple-test-conda-torch1.13.1-rocm5.2.3.sh) - * [03-simple-test-conda-torch2.1.2-rocm5.5.3.sh](scripts/tests/03-simple-test-conda-torch2.1.2-rocm5.5.3.sh) - * from source: [04-simple-test-source-torch1.13.1-rocm5.2.3.sh](scripts/tests/04-simple-test-source-torch1.13.1-rocm5.2.3.sh) + * [03-simple-test-conda-torch1.13.1-rocm5.2.3.sh](scripts/tests/03-simple-test-conda-torch1.13.1-rocm5.2.3.sh) + * [03-simple-test-conda-torch2.1.2-rocm5.5.3.sh](scripts/tests/03-simple-test-conda-torch2.1.2-rocm5.5.3.sh) + * from source: [04-simple-test-source-torch1.13.1-rocm5.2.3.sh](scripts/tests/04-simple-test-source-torch1.13.1-rocm5.2.3.sh) * containers (singularity) - * [05-simple-test-container-torch2.0.1-rocm5.5.1.sh](scripts/tests/05-simple-test-container-torch2.0.1-rocm5.5.1.sh) - * [05-simple-test-container-torch2.1.0-rocm5.6.1.sh](scripts/tests/05-simple-test-container-torch2.1.0-rocm5.6.1.sh) + * [05-simple-test-container-torch2.0.1-rocm5.5.1.sh](scripts/tests/05-simple-test-container-torch2.0.1-rocm5.5.1.sh) + * [05-simple-test-container-torch2.1.0-rocm5.6.1.sh](scripts/tests/05-simple-test-container-torch2.1.0-rocm5.6.1.sh) ### Run Interactive Job on Multiple Nodes @@ -73,10 +73,10 @@ salloc -A project_XXX --partition=standard-g -N 2 -n 16 --gpus 16 -t 01:00:00 ### Scripts * containers (singularity) - * [07-mnist-distributed-learning-container-torch2.0.1-rocm5.5.1.sh](scripts/tests/07-mnist-distributed-learning-container-torch2.0.1-rocm5.5.1.sh) - * [07-mnist-distributed-learning-container-torch2.1.0-rocm5.6.1.sh](scripts/tests/07-mnist-distributed-learning-container-torch2.1.0-rocm5.6.1.sh) - * [08-cnn-distributed-container-torch2.0.1-rocm5.5.1.sh](scripts/tests/08-cnn-distributed-container-torch2.0.1-rocm5.5.1.sh) - * [08-cnn-distributed-container-torch2.1.0-rocm5.6.1.sh](scripts/tests/08-cnn-distributed-container-torch2.1.0-rocm5.6.1.sh) + * [07-mnist-distributed-learning-container-torch2.0.1-rocm5.5.1.sh](scripts/tests/07-mnist-distributed-learning-container-torch2.0.1-rocm5.5.1.sh) + * [07-mnist-distributed-learning-container-torch2.1.0-rocm5.6.1.sh](scripts/tests/07-mnist-distributed-learning-container-torch2.1.0-rocm5.6.1.sh) + * [08-cnn-distributed-container-torch2.0.1-rocm5.5.1.sh](scripts/tests/08-cnn-distributed-container-torch2.0.1-rocm5.5.1.sh) + * [08-cnn-distributed-container-torch2.1.0-rocm5.6.1.sh](scripts/tests/08-cnn-distributed-container-torch2.1.0-rocm5.6.1.sh) ## Tips @@ -141,18 +141,18 @@ $SIF /workdir/setup-me.sh ### RCCL * The problem – on startup we can see: - * `NCCL error in: /pfs/lustrep2/projappl/project_462000125/samantao/pytorchexample/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, unhandled system error, NCCL version 2.12.12` + * `NCCL error in: /pfs/lustrep2/projappl/project_462000125/samantao/pytorchexample/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, unhandled system error, NCCL version 2.12.12` * Checking error origin: - * `export NCCL_DEBUG=INFO` - * `NCCL INFO NET/Socket : Using [0]nmn0:10.120.116.65<0> [1]hsn0:10.253.6.67<0> [2]hsn1:10.253.6.68<0>[3]hsn2:10.253.2.12<0> [4]hsn3:10.253.2.11<0>` - * `NCCL INFO /long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/rccl/src/init.cc:1292` + * `export NCCL_DEBUG=INFO` + * `NCCL INFO NET/Socket : Using [0]nmn0:10.120.116.65<0> [1]hsn0:10.253.6.67<0> [2]hsn1:10.253.6.68<0>[3]hsn2:10.253.2.12<0> [4]hsn3:10.253.2.11<0>` + * `NCCL INFO /long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/rccl/src/init.cc:1292` * The fix: - * `export NCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3` + * `export NCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3` ### RCCL AWS-CXI Plugin * RCCL relies on runtime plugin-ins to connect with some transport layers - * Libfabric – provider for Slingshot + * Libfabric – provider for Slingshot * Hipified plugin adapted from AWS OpenFabrics support available * [https://github.com/ROCmSoftwarePlatform/aws-ofi-rccl][7] * 3-4x faster collectives