Skip to content
Snippets Groups Projects
slurm-job-submission-and-execution.md 3.53 KiB
Newer Older
  • Learn to ignore specific revisions
  • Jan Siwiec's avatar
    Jan Siwiec committed
    # Slurm Job Submission and Execution
    
    [Slurm][1] workload manager is used to allocate and access Barbora cluster and Complementary systems resources.
    
    ## Getting Partitions Information
    
    Display partitions/queues
    
    
    ```console
    $ sinfo -s
    
    PARTITION    AVAIL  TIMELIMIT   NODES(A/I/O/T) NODELIST
    
    qcpu*           up 2-00:00:00      1/191/0/192 cn[1-192]
    qcpu_biz        up 2-00:00:00      1/191/0/192 cn[1-192]
    qcpu_exp        up    1:00:00      1/191/0/192 cn[1-192]
    qcpu_free       up   18:00:00      1/191/0/192 cn[1-192]
    qcpu_long       up 6-00:00:00      1/191/0/192 cn[1-192]
    qcpu_preempt    up   12:00:00      1/191/0/192 cn[1-192]
    
    qgpu            up 2-00:00:00          0/8/0/8 cn[193-200]
    qgpu_biz        up 2-00:00:00          0/8/0/8 cn[193-200]
    qgpu_exp        up    1:00:00          0/8/0/8 cn[193-200]
    qgpu_free       up   18:00:00          0/8/0/8 cn[193-200]
    qgpu_preempt    up   12:00:00          0/8/0/8 cn[193-200]
    qfat            up 2-00:00:00          0/1/0/1 cn201
    qdgx            up 2-00:00:00          0/1/0/1 cn202
    qviz            up    8:00:00          0/2/0/2 vizserv[1-2]
    
    ## Getting Job Information
    
    $ squeue --me
                 JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                   104   qcpu    interact    user   R       1:48      2 cn[101-102]
    
    Show job details for specific job
    
    $ scontrol -d show job JOBID
    
    Show job details for executing job from job session
    
    $ scontrol -d show job $SLURM_JOBID
    
    ## Running Interactive Jobs
    
    Run interactive job
    
     $ salloc -A PROJECT-ID -p qcpu
    
    Run interactive job, with X11 forwarding
    
     $ salloc -A PROJECT-ID -p qcpu --x11
    
    !!! warning
        Do not use `srun` for initiating interactive jobs, subsequent `srun`, `mpirun` invocations would block forever.
    
    ```shell
    #!/usr/bin/bash
    #SBATCH -J MyJobName
    #SBATCH -A OPEN-00-00
    #SBATCH -N 4
    #SBATCH --ntasks-per-node 36
    #SBATCH -p qcpu
    #SBATCH -t 12:00:00
    
    ml OpenMPI/4.1.4-GCC-11.3.0
    
    Useful command options (salloc, sbatch, srun)
    
    * -N, --nodes
    * --tasks-per-node
    * -n, --ntasks
    * -c, --cpus-per-task
    
    ## Slurm Job Environment Variables
    
    Slurm provides useful information to the job via environment variables. Environment variables are available on all nodes allocated to job when accessed via Slurm supported means (srun, compatible mpirun).
    
    See all Slurm variables
    
    | variable name | description | example |
    | ------ | ------ | ------ |
    | SLURM_JOBID | job id of the executing job| 593 |
    | SLURM_JOB_NODELIST | nodes allocated to the job | cn[101-102] |
    | SLURM_JOB_NUM_NODES | number of nodes allocated to the job | 2 |
    | SLURM_STEP_NODELIST | nodes allocated to the job step | cn101 |
    | SLURM_STEP_NUM_NODES | number of nodes allocated to the job step | 1 |
    | SLURM_JOB_PARTITION | name of the partition | qcpu |
    | SLURM_SUBMIT_DIR | submit directory | /scratch/project/open-xx-yy/work |
    
    See [Slurm srun documentation][2] for details.
    
    ```
    $ echo $SLURM_JOB_NODELIST
    cn[101-102]
    
    Expand nodelist to list of nodes.
    
    ```
    $ scontrol show hostnames $SLURM_JOB_NODELIST
    cn101
    cn102
    
    ```
    $ scontrol update JobId=JOBID ATTR=VALUE
    
    ```
    $ scontrol update JobId=JOBID Comment='The best job ever'
    
    [1]: https://slurm.schedmd.com/
    [2]: https://slurm.schedmd.com/srun.html#SECTION_OUTPUT-ENVIRONMENT-VARIABLES