Skip to content
Snippets Groups Projects
job-scheduling.md 10.4 KiB
Newer Older
  • Learn to ignore specific revisions
  • # Complementary System Job Scheduling
    
    ## Introduction
    
    [Slurm][1] workload manager is used to allocate and access Complementary systems resources.
    
    
    Roman Sliva's avatar
    Roman Sliva committed
    ## Getting Partition Information
    
    
    Display partitions/queues
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    
    ```console
    
    $ sinfo -s
    PARTITION AVAIL  TIMELIMIT   NODES(A/I/O/T) NODELIST
    p00-arm      up 1-00:00:00          0/1/0/1 p00-arm01
    p01-arm*     up 1-00:00:00          0/8/0/8 p01-arm[01-08]
    
    Roman Sliva's avatar
    Roman Sliva committed
    p02-intel    up 1-00:00:00          0/2/0/2 p02-intel[01-02]
    
    p03-amd      up 1-00:00:00          0/2/0/2 p03-amd[01-02]
    p04-edge     up 1-00:00:00          0/1/0/1 p04-edge01
    p05-synt     up 1-00:00:00          0/1/0/1 p05-synt01
    
    Roman Sliva's avatar
    Roman Sliva committed
    p06-arm      up 1-00:00:00          0/2/0/2 p06-arm[01-02]
    p07-power    up 1-00:00:00          0/1/0/1 p07-power01
    p08-amd      up 1-00:00:00          0/1/0/1 p08-amd01
    
    p10-intel    up 1-00:00:00          0/1/0/1 p10-intel01
    
    Roman Sliva's avatar
    Roman Sliva committed
    ## Getting Job Information
    
    
    Show jobs
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    
    ```console
    
    $ squeue --me
                 JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                   104   p01-arm interact    user   R       1:48      2 p01-arm[01-02]
    ```
    
    
    Roman Sliva's avatar
    Roman Sliva committed
    Show job details for specific job
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    
    ```console
    
    Roman Sliva's avatar
    Roman Sliva committed
    $ scontrol -d show job JOBID
    
    Roman Sliva's avatar
    Roman Sliva committed
    Show job details for executing job from job session
    
    ```console
    $ scontrol -d show job $SLURM_JOBID
    ```
    
    
    Roman Sliva's avatar
    Roman Sliva committed
    ## Running Interactive Jobs
    
    Run interactive job
    
    ```console
     $ salloc -A PROJECT-ID -p p01-arm
    ```
    
    Run interactive job, with X11 forwarding
    
    ```console
     $ salloc -A PROJECT-ID -p p01-arm --x11
    ```
    
    !!! warning
        Do not use `srun` for initiating interactive jobs, subsequent `srun`, `mpirun` invocations would block forever.
    
    ## Running Batch Jobs
    
    Run batch job
    
    ```console
    
    Roman Sliva's avatar
    Roman Sliva committed
     $ sbatch -A PROJECT-ID -p p01-arm ./script.sh
    
    Roman Sliva's avatar
    Roman Sliva committed
    ```
    
    Useful command options (salloc, sbatch, srun)
    
    * -n, --ntasks
    * -c, --cpus-per-task
    * -N, --nodes
    
    
    Roman Sliva's avatar
    Roman Sliva committed
    ## Slurm Job Environment Variables
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    Slurm provides useful information to the job via environment variables. Environment variables are available on all nodes allocated to job when accessed via Slurm supported means (srun, compatible mpirun).
    
    Roman Sliva's avatar
    Roman Sliva committed
    
    See all Slurm variables
    
    Roman Sliva's avatar
    Roman Sliva committed
    ```
    
    Roman Sliva's avatar
    Roman Sliva committed
    set | grep ^SLURM
    
    Roman Sliva's avatar
    Roman Sliva committed
    ### Useful Variables
    
    Roman Sliva's avatar
    Roman Sliva committed
    | variable name | description | example |
    | ------ | ------ | ------ |
    
    Roman Sliva's avatar
    Roman Sliva committed
    | SLURM_JOB_ID | job id of the executing job| 593 |
    
    Roman Sliva's avatar
    Roman Sliva committed
    | SLURM_JOB_NODELIST | nodes allocated to the job | p03-amd[01-02] |
    | SLURM_JOB_NUM_NODES | number of nodes allocated to the job | 2 |
    | SLURM_STEP_NODELIST | nodes allocated to the job step | p03-amd01 |
    | SLURM_STEP_NUM_NODES | number of nodes allocated to the job step | 1 |
    | SLURM_JOB_PARTITION | name of the partition | p03-amd |
    | SLURM_SUBMIT_DIR | submit directory | /scratch/project/open-xx-yy/work |
    
    Roman Sliva's avatar
    Roman Sliva committed
    
    See [Slurm srun documentation][2] for details.
    
    Get job nodelist
    
    Roman Sliva's avatar
    Roman Sliva committed
    ```
    $ echo $SLURM_JOB_NODELIST
    p03-amd[01-02]
    ```
    
    Expand nodelist to list of nodes.
    
    Roman Sliva's avatar
    Roman Sliva committed
    ```
    $ scontrol show hostnames $SLURM_JOB_NODELIST
    p03-amd01
    p03-amd02
    ```
    
    
    Roman Sliva's avatar
    Roman Sliva committed
    ## Modifying Jobs
    
    ```
    $ scontrol update JobId=JOBID ATTR=VALUE
    ```
    
    for example
    
    ```
    $ scontrol update JobId=JOBID Comment='The best job ever'
    ```
    
    ## Deleting Jobs
    
    ```
    $ scancel JOBID
    ```
    
    
    Roman Sliva's avatar
    Roman Sliva committed
    ## Partitions
    
    
    | PARTITION | nodes | whole node | cores per node | features |
    | --------- | ----- | ---------- | -------------- | -------- |
    | p00-arm   | 1     | yes        | 64             | aarch64,cortex-a72 |
    | p01-arm   | 8     | yes        | 48             | aarch64,a64fx,ib |
    | p02-intel | 2     | no         | 64             | x86_64,intel,icelake,ib,fpga,bitware,nvdimm |
    | p03-amd   | 2     | no         | 64             | x86_64,amd,milan,ib,gpu,mi100,fpga,xilinx |
    | p04-edge  | 1     | yes        | 16             | 86_64,intel,broadwell,ib |
    | p05-synt  | 1     | yes        | 8              | x86_64,amd,milan,ib,ht |
    | p06-arm   | 2     | yes        | 80             | aarch64,ib |
    | p07-power | 1     | yes        | 192            | ppc64le,ib |
    | p08-amd   | 1     | yes        | 128            | x86_64,amd,milan-x,ib,ht |
    
    | p10-intel | 1     | yes        | 96             | x86_64,intel,sapphire_rapids,ht|
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    Use `-t`, `--time` option to specify job run time limit. Default job time limit is 2 hours, maximum job time limit is 24 hours.
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    FIFO scheduling with backfilling is employed.
    
    Roman Sliva's avatar
    Roman Sliva committed
    ## Partition 00 - ARM (Cortex-A72)
    
    
    Whole node allocation.
    
    One node:
    
    ```console
    
    Roman Sliva's avatar
    Roman Sliva committed
    salloc -A PROJECT-ID -p p00-arm
    
    ```
    
    ## Partition 01 - ARM (A64FX)
    
    Whole node allocation.
    
    One node:
    
    ```console
    
    Roman Sliva's avatar
    Roman Sliva committed
    salloc -A PROJECT-ID -p p01-arm
    
    Roman Sliva's avatar
    Roman Sliva committed
    salloc -A PROJECT-ID -p p01-arm -N=1
    
    ```
    
    Multiple nodes:
    
    ```console
    
    Roman Sliva's avatar
    Roman Sliva committed
    salloc -A PROJECT-ID -p p01-arm -N=8
    
    ```
    
    ## Partition 02 - Intel (Ice Lake, NVDIMMs + Bitware FPGAs)
    
    
    Roman Sliva's avatar
    Roman Sliva committed
    FPGAs are treated as resources. See below for more details about resources.
    
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    Partial allocation - per FPGA, resource separation is not enforced.
    
    Roman Sliva's avatar
    Roman Sliva committed
    Use only FPGAs allocated to the job!
    
    Roman Sliva's avatar
    Roman Sliva committed
    salloc -A PROJECT-ID -p p02-intel --gres=fpga
    
    ```
    
    Two FPGAs on the same node:
    
    ```console
    
    Roman Sliva's avatar
    Roman Sliva committed
    salloc -A PROJECT-ID -p p02-intel --gres=fpga:2
    
    Roman Sliva's avatar
    Roman Sliva committed
    salloc -A PROJECT-ID -p p02-intel -N 2 --gres=fpga:2
    
    ```
    
    ## Partition 03 - AMD (Milan, MI100 GPUs + Xilinx FPGAs)
    
    
    Roman Sliva's avatar
    Roman Sliva committed
    GPUs and FPGAs are treated as resources. See below for more details about resources.
    
    Roman Sliva's avatar
    Roman Sliva committed
    Partial allocation - per GPU and per FPGA, resource separation is not enforced.
    Use only GPUs and FPGAs allocated to the job!
    
    Roman Sliva's avatar
    Roman Sliva committed
    salloc -A PROJECT-ID -p p03-amd --gres=gpu
    
    ```
    
    Two GPUs on the same node:
    
    ```console
    
    Roman Sliva's avatar
    Roman Sliva committed
    salloc -A PROJECT-ID -p p03-amd --gres=gpu:2
    
    ```
    
    Four GPUs on the same node:
    
    ```console
    
    Roman Sliva's avatar
    Roman Sliva committed
    salloc -A PROJECT-ID -p p03-amd --gres=gpu:4
    
    Roman Sliva's avatar
    Roman Sliva committed
    salloc -A PROJECT-ID -p p03-amd -N 2 --gres=gpu:4
    
    Roman Sliva's avatar
    Roman Sliva committed
    salloc -A PROJECT-ID -p p03-amd --gres=fpga
    
    Roman Sliva's avatar
    Roman Sliva committed
    salloc -A PROJECT-ID -p p03-amd --gres=fpga:2
    
    Roman Sliva's avatar
    Roman Sliva committed
    salloc -A PROJECT-ID -p p03-amd -N 2--gres=fpga:2
    
    ```
    
    One GPU and one FPGA on the same node:
    
    ```console
    
    Roman Sliva's avatar
    Roman Sliva committed
    salloc -A PROJECT-ID -p p03-amd --gres=gpu,fpga
    
    ```
    
    Four GPUs and two FPGAs on the same node:
    
    ```console
    
    Roman Sliva's avatar
    Roman Sliva committed
    salloc -A PROJECT-ID -p p03-amd --gres=gpu:4,fpga:2
    
    ```
    
    All GPUs and FPGAs:
    
    ```console
    
    Roman Sliva's avatar
    Roman Sliva committed
    salloc -A PROJECT-ID -p p03-amd -N 2 --gres=gpu:4,fpga:2
    
    ```
    
    ## Partition 04 - Edge Server
    
    Whole node allocation:
    
    ```console
    
    Roman Sliva's avatar
    Roman Sliva committed
    salloc -A PROJECT-ID -p p04-edge
    
    ```
    
    ## Partition 05 - FPGA Synthesis Server
    
    Whole node allocation:
    
    ```console
    
    Roman Sliva's avatar
    Roman Sliva committed
    salloc -A PROJECT-ID -p p05-synt
    
    Roman Sliva's avatar
    Roman Sliva committed
    ## Partition 06 - ARM
    
    Whole node allocation:
    
    ```console
    salloc -A PROJECT-ID -p p06-arm
    ```
    
    ## Partition 07 - IBM Power
    
    Whole node allocation:
    
    ```console
    salloc -A PROJECT-ID -p p07-power
    ```
    
    ## Partition 08 - AMD Milan-X
    
    Whole node allocation:
    
    ```console
    salloc -A PROJECT-ID -p p08-amd
    ```
    
    
    ## Partition 10 - Intel Sapphire Rapids
    
    Whole node allocation:
    
    ```console
    salloc -A PROJECT-ID -p p10-intel
    
    Roman Sliva's avatar
    Roman Sliva committed
    ## Features
    
    Nodes have feature tags assigned to them.
    
    Roman Sliva's avatar
    Roman Sliva committed
    Users can select nodes based on the feature tags using --constraint option.
    
    Roman Sliva's avatar
    Roman Sliva committed
    
    | Feature | Description |
    | ------ | ------ |
    | aarch64 | platform |
    | x86_64 | platform |
    
    Roman Sliva's avatar
    Roman Sliva committed
    | ppc64le | platform |
    
    Roman Sliva's avatar
    Roman Sliva committed
    | amd | manufacturer |
    | intel | manufacturer |
    | icelake | processor family |
    | broadwell | processor family |
    
    | sapphire_rapids | processor family |
    
    Roman Sliva's avatar
    Roman Sliva committed
    | milan | processor family |
    
    Roman Sliva's avatar
    Roman Sliva committed
    | milan-x | processor family |
    
    Roman Sliva's avatar
    Roman Sliva committed
    | ib | Infiniband |
    
    Roman Sliva's avatar
    Roman Sliva committed
    | gpu | equipped with GPU |
    
    Roman Sliva's avatar
    Roman Sliva committed
    | fpga | equipped with FPGA |
    | nvdimm | equipped with NVDIMMs |
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    | ht | Hyperthreading enabled |
    | noht | Hyperthreading disabled |
    
    Roman Sliva's avatar
    Roman Sliva committed
    
    ```
    $ sinfo -o '%16N %f'
    NODELIST         AVAIL_FEATURES
    p00-arm01        aarch64,cortex-a72
    p01-arm[01-08]   aarch64,a64fx,ib
    p02-intel01      x86_64,intel,icelake,ib,fpga,bitware,nvdimm,ht
    p02-intel02      x86_64,intel,icelake,ib,fpga,bitware,nvdimm,noht
    
    Roman Sliva's avatar
    Roman Sliva committed
    p03-amd02        x86_64,amd,milan,ib,gpu,mi100,fpga,xilinx,noht
    
    p03-amd01        x86_64,amd,milan,ib,gpu,mi100,fpga,xilinx,ht
    
    Roman Sliva's avatar
    Roman Sliva committed
    p04-edge01       x86_64,intel,broadwell,ib,ht
    p05-synt01       x86_64,amd,milan,ib,ht
    
    p06-arm[01-02]   aarch64,ib
    p07-power01      ppc64le,ib
    p08-amd01        x86_64,amd,milan-x,ib,ht
    p10-intel01      x86_64,intel,sapphire_rapids,ht
    
    Roman Sliva's avatar
    Roman Sliva committed
    ```
    
    Roman Sliva's avatar
    Roman Sliva committed
    $ salloc -A PROJECT-ID -p p02-intel --constraint noht
    
    Roman Sliva's avatar
    Roman Sliva committed
    ```
    
    Roman Sliva's avatar
    Roman Sliva committed
    $ scontrol -d show node p02-intel02 | grep ActiveFeatures
       ActiveFeatures=x86_64,intel,icelake,ib,fpga,bitware,nvdimm,noht
    
    Roman Sliva's avatar
    Roman Sliva committed
    ## Resources, GRES
    
    Roman Sliva's avatar
    Roman Sliva committed
    Slurm supports the ability to define and schedule arbitrary resources - Generic RESources (GRES) in Slurm's terminology. We use GRES for scheduling/allocating GPUs and FPGAs.
    
    Roman Sliva's avatar
    Roman Sliva committed
    !!! warning
    
    Jan Siwiec's avatar
    Jan Siwiec committed
        Use only allocated GPUs and FPGAs. Resource separation is not enforced. If you use non-allocated resources, you can observe strange behavior and get into troubles.
    
    Roman Sliva's avatar
    Roman Sliva committed
    ### Node Resources
    
    Roman Sliva's avatar
    Roman Sliva committed
    
    Get information about GRES on node.
    
    Roman Sliva's avatar
    Roman Sliva committed
    ```
    
    Roman Sliva's avatar
    Roman Sliva committed
    $ scontrol -d show node p02-intel01 | grep Gres=
       Gres=fpga:bitware_520n_mx:2
    $ scontrol -d show node p02-intel02 | grep Gres=
       Gres=fpga:bitware_520n_mx:2
    
    Roman Sliva's avatar
    Roman Sliva committed
    $ scontrol -d show node p03-amd01 | grep Gres=
    
    Roman Sliva's avatar
    Roman Sliva committed
       Gres=gpu:amd_mi100:4,fpga:xilinx_alveo_u250:2
    
    Roman Sliva's avatar
    Roman Sliva committed
    $ scontrol -d show node p03-amd02 | grep Gres=
    
    Roman Sliva's avatar
    Roman Sliva committed
       Gres=gpu:amd_mi100:4,fpga:xilinx_alveo_u280:2
    
    Roman Sliva's avatar
    Roman Sliva committed
    ### Request Resources
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    To allocate required resources (GPUs or FPGAs) use the `--gres salloc/srun` option.
    
    Example: Allocate one FPGA
    
    Roman Sliva's avatar
    Roman Sliva committed
    ```
    $ salloc -A PROJECT-ID -p p03-amd --gres fpga:1
    ```
    
    
    Roman Sliva's avatar
    Roman Sliva committed
    ### Find Out Allocated Resources
    
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    Information about allocated resources is available in Slurm job details, attributes `JOB_GRES` and `GRES`.
    
    Roman Sliva's avatar
    Roman Sliva committed
    ```
    $ scontrol -d show job $SLURM_JOBID |grep GRES=
       JOB_GRES=fpga:xilinx_alveo_u250:1
         Nodes=p03-amd01 CPU_IDs=0-1 Mem=0 GRES=fpga:xilinx_alveo_u250:1(IDX:0)
    ```
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    
    IDX in the GRES attribute specifies index/indexes of FPGA(s) (or GPUs) allocated to the job on the node. In the given example - allocated resources are `fpga:xilinx_alveo_u250:1(IDX:0)`, we should use FPGA with index/number 0 on node p03-amd01.
    
    Roman Sliva's avatar
    Roman Sliva committed
    ### Request Specific Resources
    
    Roman Sliva's avatar
    Roman Sliva committed
    It is possible to allocate specific resources. It is useful for partition p03-amd equipped with FPGAs of different types.
    
    Roman Sliva's avatar
    Roman Sliva committed
    
    GRES entry is using format "name[[:type]:count", in the following example name is fpga, type is xilinx_alveo_u280, and count is count 2.
    
    Roman Sliva's avatar
    Roman Sliva committed
    ```
    
    Roman Sliva's avatar
    Roman Sliva committed
    $ salloc -A PROJECT-ID -p p03-amd --gres=fpga:xilinx_alveo_u280:2
    salloc: Granted job allocation XXX
    salloc: Waiting for resource configuration
    salloc: Nodes p03-amd02 are ready for job
    
    
    Roman Sliva's avatar
    Roman Sliva committed
    $ scontrol -d show job $SLURM_JOBID | grep -i gres
    
    Roman Sliva's avatar
    Roman Sliva committed
       JOB_GRES=fpga:xilinx_alveo_u280:2
    
    Roman Sliva's avatar
    Roman Sliva committed
         Nodes=p03-amd02 CPU_IDs=0 Mem=0 GRES=fpga:xilinx_alveo_u280(IDX:0-1)
    
    Roman Sliva's avatar
    Roman Sliva committed
       TresPerNode=gres:fpga:xilinx_alveo_u280:2
    
    [1]: https://slurm.schedmd.com/
    
    Roman Sliva's avatar
    Roman Sliva committed
    [2]: https://slurm.schedmd.com/srun.html#SECTION_OUTPUT-ENVIRONMENT-VARIABLES