*`NCCL error in: /pfs/lustrep2/projappl/project_462000125/samantao/pytorchexample/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, unhandled system error, NCCL version 2.12.12`
*`NCCL error in: /pfs/lustrep2/projappl/project_462000125/samantao/pytorchexample/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, unhandled system error, NCCL version 2.12.12`
* Checking error origin:
* Checking error origin:
*`export NCCL_DEBUG=INFO`
*`export NCCL_DEBUG=INFO`
*`NCCL INFO NET/Socket : Using [0]nmn0:10.120.116.65<0> [1]hsn0:10.253.6.67<0> [2]hsn1:10.253.6.68<0>[3]hsn2:10.253.2.12<0> [4]hsn3:10.253.2.11<0>`
*`NCCL INFO NET/Socket : Using [0]nmn0:10.120.116.65<0> [1]hsn0:10.253.6.67<0> [2]hsn1:10.253.6.68<0>[3]hsn2:10.253.2.12<0> [4]hsn3:10.253.2.11<0>`
*`NCCL INFO /long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/rccl/src/init.cc:1292`
*`NCCL INFO /long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/rccl/src/init.cc:1292`
* The fix:
* The fix:
*`export NCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3`
*`export NCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3`
### RCCL AWS-CXI Plugin
### RCCL AWS-CXI Plugin
* RCCL relies on runtime plugin-ins to connect with some transport layers
* RCCL relies on runtime plugin-ins to connect with some transport layers
* Libfabric – provider for Slingshot
* Libfabric – provider for Slingshot
* Hipified plugin adapted from AWS OpenFabrics support available
* Hipified plugin adapted from AWS OpenFabrics support available