diff --git a/README.md b/README.md index 1ab9e612d76a67881d9e663ae5ffcf9078a04505..0b91e9d52cf654c7987aa755d69571f68d2a409d 100644 --- a/README.md +++ b/README.md @@ -57,6 +57,58 @@ module load rocm/5.2.3.lua - containers (singularity) - [05-simple-test-container-torch2.0.1-rocm5.5.1.sh](scripts/tests/05-simple-test-container-torch2.0.1-rocm5.5.1.sh) +# Tips an tricks +## Unofficially versions of ROCM +``` +module use /pfs/lustrep2/projappl/project_462000125/samantao-public/mymodules +ml rocm/5.4.3 +ml rocm/5.6.0 +``` + +## Unofficially containers +``` +ls -la /pfs/lustrep2/projappl/project_462000125/samantao-public/containers/ +``` + +## Controlling device visibility +- `HIP_VISIBLE_DEVICES=0,1,2,3 python -c 'import torch; print(torch.cuda.device_count())'` +- `ROCR_VISIBLE_DEVICES=0,1,2,3 python -c 'import torch; print(torch.cuda.device_count())'` +- SLURM sets `ROCR_VISIBLE_DEVICES` +- Implications of both ways of setting visibility – blit kernels and/or DMA + +## RCCL +- The problem – on startup we can see: + - `NCCL error in: /pfs/lustrep2/projappl/project_462000125/samantao/pytorchexample/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, unhandled system error, NCCL version 2.12.12` +- Checking error origin: + - `export NCCL_DEBUG=INFO` + - `NCCL INFO NET/Socket : Using [0]nmn0:10.120.116.65<0> [1]hsn0:10.253.6.67<0> [2]hsn1:10.253.6.68<0>[3]hsn2:10.253.2.12<0> [4]hsn3:10.253.2.11<0>` + - `NCCL INFO /long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/rccl/src/init.cc:1292` +- The fix: + - `export NCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3` + +## RCCL AWS-CXI plugin +- RCCL relies on runtime plugin-ins to connect with some transport layers + - Libfabric – provider for Slingshot +- Hipified plugin adapted from AWS OpenFabrics support available +- https://github.com/ROCmSoftwarePlatform/aws-ofi-rccl +- 3-4x faster collectives +- Plugin needs to be pointed at by the loading environment +``` +module use /pfs/lustrep2/projappl/project_462000125/samantao-public/mymodules +module load aws-ofi-rccl/rocm-5.2.3.lua +# Or +export LD_LIBRARY_PATH=/pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofirccl +# (will detect librccl-net.so) +``` +- Verify the plugin is detected +``` +export NCCL_DEBUG=INFO +export NCCL_DEBUG_SUBSYS=INIT +# and search the logs for: +# [0] NCCL INFO NET/OFI Using aws-ofi-rccl 1.4.0 +``` # References -- Samuel Antao (AMD), Comprehensive General LUMI Course - October 3-6th \ No newline at end of file +- Samuel Antao (AMD), LUMI Courses +- https://lumi-supercomputer.github.io/LUMI-training-materials/4day-20230530/extra_4_10_Best_Practices_GPU_Optimization/ +- https://lumi-supercomputer.github.io/LUMI-training-materials/4day-20231003/extra_4_10_Best_Practices_GPU_Optimization/ \ No newline at end of file