Skip to content
Snippets Groups Projects
pytorch.md 8.45 KiB
Newer Older
  • Learn to ignore specific revisions
  • Jan Siwiec's avatar
    Jan Siwiec committed
    # PyTorch
    
    ## PyTorch Highlight
    
    * Official page: [https://pytorch.org/][1]
    * Code: [https://github.com/pytorch/pytorch][2]
    * Python-based framework for machine learning
      * Auto-differentiation on tensor types
    * Official LUMI page: [https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/p/PyTorch/][3]
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
      * **Warning:** be careful where the SIF image is installed or copied ($HOME is not recommended for quota reasons). For EasyBuild you must specify the installation path: `export EBU_USER_PREFIX=/project/project_XXXX/EasyBuild`.
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    
    ## PyTorch Install
    
    ### Base Environment
    
    ```console
    module purge
    module load CrayEnv
    module load PrgEnv-cray/8.3.3
    module load craype-accel-amd-gfx90a
    module load cray-python
    
    # Default ROCm – more recent versions are preferable (e.g. ROCm 5.6.0)
    module load rocm/5.2.3.lua
    ```
    
    ### Scripts
    
    * natively
    
    Jan Siwiec's avatar
    Jan Siwiec committed
      * [01-install-direct-torch1.13.1-rocm5.2.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/install/01-install-direct-torch1.13.1-rocm5.2.3.sh)
      * [01-install-direct-torch2.1.2-rocm5.5.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/install/01-install-direct-torch2.1.2-rocm5.5.3.sh)
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    * virtual env
    
    Jan Siwiec's avatar
    Jan Siwiec committed
      * [02-install-venv-torch1.13.1-rocm5.2.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/install/02-install-venv-torch1.13.1-rocm5.2.3.sh)
      * [02-install-venv-torch2.1.2-rocm5.5.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/install/02-install-venv-torch2.1.2-rocm5.5.3.sh)
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    * conda env
    
    Jan Siwiec's avatar
    Jan Siwiec committed
      * [03-install-conda-torch1.13.1-rocm5.2.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/install/03-install-conda-torch1.13.1-rocm5.2.3.sh)
      * [03-install-conda-torch2.1.2-rocm5.5.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/install/03-install-conda-torch2.1.2-rocm5.5.3.sh)
      * from source: [04-install-source-torch1.13.1-rocm5.2.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/install/04-install-source-torch1.13.1-rocm5.2.3.sh)
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    * containers (singularity)
    
    Jan Siwiec's avatar
    Jan Siwiec committed
      * [05-install-container-torch2.0.1-rocm5.5.1.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/install/05-install-container-torch2.0.1-rocm5.5.1.sh)
      * [05-install-container-torch2.1.0-rocm5.6.1.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/install/05-install-container-torch2.1.0-rocm5.6.1.sh)
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    
    ## PyTorch Tests
    
    ### Run Interactive Job on Single Node
    
    ```console
    salloc -A project_XXX --partition=standard-g -N 1 -n 1 --gpus 8 -t 01:00:00
    ```
    
    ### Scripts
    
    * natively
    
    Jan Siwiec's avatar
    Jan Siwiec committed
      * [01-simple-test-direct-torch1.13.1-rocm5.2.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/01-simple-test-direct-torch1.13.1-rocm5.2.3.sh)
      * [01-simple-test-direct-torch2.1.2-rocm5.5.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/01-simple-test-direct-torch2.1.2-rocm5.5.3.sh)
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    * virtual env
    
    Jan Siwiec's avatar
    Jan Siwiec committed
      * [02-simple-test-venv-torch1.13.1-rocm5.2.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/02-simple-test-venv-torch1.13.1-rocm5.2.3.sh)
      * [02-simple-test-venv-torch2.1.2-rocm5.5.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/02-simple-test-venv-torch2.1.2-rocm5.5.3.sh)
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    * conda env
    
    Jan Siwiec's avatar
    Jan Siwiec committed
      * [03-simple-test-conda-torch1.13.1-rocm5.2.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/03-simple-test-conda-torch1.13.1-rocm5.2.3.sh)
      * [03-simple-test-conda-torch2.1.2-rocm5.5.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/03-simple-test-conda-torch2.1.2-rocm5.5.3.sh)
      * from source: [04-simple-test-source-torch1.13.1-rocm5.2.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/04-simple-test-source-torch1.13.1-rocm5.2.3.sh)
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    * containers (singularity)
    
    Jan Siwiec's avatar
    Jan Siwiec committed
      * [05-simple-test-container-torch2.0.1-rocm5.5.1.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/05-simple-test-container-torch2.0.1-rocm5.5.1.sh)
      * [05-simple-test-container-torch2.1.0-rocm5.6.1.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/05-simple-test-container-torch2.1.0-rocm5.6.1.sh)
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    
    ### Run Interactive Job on Multiple Nodes
    
    ```
    salloc -A project_XXX --partition=standard-g -N 2 -n 16 --gpus 16 -t 01:00:00
    ```
    
    ### Scripts
    
    * containers (singularity)
    
    Jan Siwiec's avatar
    Jan Siwiec committed
      * [07-mnist-distributed-learning-container-torch2.0.1-rocm5.5.1.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/07-mnist-distributed-learning-container-torch2.0.1-rocm5.5.1.sh)
      * [07-mnist-distributed-learning-container-torch2.1.0-rocm5.6.1.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/07-mnist-distributed-learning-container-torch2.1.0-rocm5.6.1.sh)
      * [08-cnn-distributed-container-torch2.0.1-rocm5.5.1.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/08-cnn-distributed-container-torch2.0.1-rocm5.5.1.sh)
      * [08-cnn-distributed-container-torch2.1.0-rocm5.6.1.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/08-cnn-distributed-container-torch2.1.0-rocm5.6.1.sh)
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    
    ## Tips
    
    ### Official Containers
    
    ```
    ls -la /appl/local/containers/easybuild-sif-images/
    ```
    
    ### Unofficial Versions of ROCM
    
    ```
    module use /pfs/lustrep2/projappl/project_462000125/samantao-public/mymodules
    ml rocm/5.4.3
    ml rocm/5.6.0
    ```
    
    ### Unofficial Containers
    
    ```
    ls -la /pfs/lustrep2/projappl/project_462000125/samantao-public/containers/
    ```
    
    ### Installing Python Modules in Containers
    
    ```console
    #!/bin/bash
    
    wd=$(pwd)
    SIF=/pfs/lustrep2/projappl/project_462000125/samantao-public/containers/lumi-pytorch-rocm-5.6.1-python-3.10-pytorch-v2.1.0-dockerhash-aa8dbea5e0e4.sif
    
    rm -rf $wd/setup-me.sh
    cat > $wd/setup-me.sh << EOF
    #!/bin/bash -e
    
    \$WITH_CONDA
    pip3 install scipy h5py tqdm
    EOF
    chmod +x $wd/setup-me.sh
    
    mkdir -p $wd/pip_install
    
    srun -n 1 --gpus 8 singularity exec \
    -B /var/spool/slurmd:/var/spool/slurmd \
    -B /opt/cray:/opt/cray \
    -B /usr/lib64/libcxi.so.1:/usr/lib64/libcxi.so.1 \
    -B $wd:/workdir \
    -B $wd/pip_install:$HOME/.local/lib \
    $SIF /workdir/setup-me.sh
    
    # Add the path of pip_install to singularity-exec in run.sh:
    # -B $wd/pip_install:$HOME/.local/lib \
    ```
    
    ### Controlling Device Visibility
    
    * `HIP_VISIBLE_DEVICES=0,1,2,3 python -c 'import torch; print(torch.cuda.device_count())'`
    * `ROCR_VISIBLE_DEVICES=0,1,2,3 python -c 'import torch; print(torch.cuda.device_count())'`
    * SLURM sets `ROCR_VISIBLE_DEVICES`
    * Implications of both ways of setting visibility – blit kernels and/or DMA
    
    ### RCCL
    
    * The problem – on startup we can see:
    
    Jan Siwiec's avatar
    Jan Siwiec committed
      * `NCCL error in: /pfs/lustrep2/projappl/project_462000125/samantao/pytorchexample/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, unhandled system error, NCCL version 2.12.12`
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    * Checking error origin:
    
    Jan Siwiec's avatar
    Jan Siwiec committed
      * `export NCCL_DEBUG=INFO`
      * `NCCL INFO NET/Socket : Using [0]nmn0:10.120.116.65<0> [1]hsn0:10.253.6.67<0> [2]hsn1:10.253.6.68<0>[3]hsn2:10.253.2.12<0> [4]hsn3:10.253.2.11<0>`
      * `NCCL INFO /long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/rccl/src/init.cc:1292`
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    * The fix:
    
    Jan Siwiec's avatar
    Jan Siwiec committed
      * `export NCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3`
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    
    ### RCCL AWS-CXI Plugin
    
    * RCCL relies on runtime plugin-ins to connect with some transport layers
    
    Jan Siwiec's avatar
    Jan Siwiec committed
      * Libfabric – provider for Slingshot
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    * Hipified plugin adapted from AWS OpenFabrics support available
    * [https://github.com/ROCmSoftwarePlatform/aws-ofi-rccl][7]
    * 3-4x faster collectives
    * Plugin needs to be pointed at by the loading environment
    
    ```console
    module use /pfs/lustrep2/projappl/project_462000125/samantao-public/mymodules
    module load aws-ofi-rccl/rocm-5.2.3.lua
    # Or
    export LD_LIBRARY_PATH=/pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofirccl
    # (will detect librccl-net.so)
    ```
    
    * Verify the plugin is detected
    
    ```console
    export NCCL_DEBUG=INFO
    export NCCL_DEBUG_SUBSYS=INIT
    # and search the logs for:
    # [0] NCCL INFO NET/OFI Using aws-ofi-rccl 1.4.0
    ```
    
    ### amdgpu.ids Issue
    
    [https://github.com/pytorch/builder/issues/1410][4]
    
    ## References
    
    * Samuel Antao (AMD), LUMI Courses
    * [https://lumi-supercomputer.github.io/LUMI-training-materials/4day-20230530/extra_4_10_Best_Practices_GPU_Optimization/][5]
    * [https://lumi-supercomputer.github.io/LUMI-training-materials/4day-20231003/extra_4_10_Best_Practices_GPU_Optimization/][6]
    
    [1]: https://pytorch.org/
    [2]: https://github.com/pytorch/pytorch
    [3]: https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/p/PyTorch/
    [4]: https://github.com/pytorch/builder/issues/1410
    [5]: https://lumi-supercomputer.github.io/LUMI-training-materials/4day-20230530/extra_4_10_Best_Practices_GPU_Optimization/
    [6]: https://lumi-supercomputer.github.io/LUMI-training-materials/4day-20231003/extra_4_10_Best_Practices_GPU_Optimization/
    [7]: https://github.com/ROCmSoftwarePlatform/aws-ofi-rccl