Skip to content
Snippets Groups Projects

Slurm Job Submission and Execution

Introduction

Slurm workload manager is used to allocate and access Barbora cluster and Complementary systems resources. Karolina cluster coming soon...

A man page exists for all Slurm commands, as well as --help command option, which provides a brief summary of options. Slurm documentation and man pages are also available online.

Getting Partitions Information

Display partitions/queues on system:

$ sinfo -s
PARTITION    AVAIL  TIMELIMIT   NODES(A/I/O/T) NODELIST
qcpu*           up 2-00:00:00      1/191/0/192 cn[1-192]
qcpu_biz        up 2-00:00:00      1/191/0/192 cn[1-192]
qcpu_exp        up    1:00:00      1/191/0/192 cn[1-192]
qcpu_free       up   18:00:00      1/191/0/192 cn[1-192]
qcpu_long       up 6-00:00:00      1/191/0/192 cn[1-192]
qcpu_preempt    up   12:00:00      1/191/0/192 cn[1-192]
qgpu            up 2-00:00:00          0/8/0/8 cn[193-200]
qgpu_biz        up 2-00:00:00          0/8/0/8 cn[193-200]
qgpu_exp        up    1:00:00          0/8/0/8 cn[193-200]
qgpu_free       up   18:00:00          0/8/0/8 cn[193-200]
qgpu_preempt    up   12:00:00          0/8/0/8 cn[193-200]
qfat            up 2-00:00:00          0/1/0/1 cn201
qdgx            up 2-00:00:00          0/1/0/1 cn202
qviz            up    8:00:00          0/2/0/2 vizserv[1-2]

NODES(A/I/O/T) column sumarizes node count per state, where the A/I/O/T stands for allocated/idle/other/total. Example output is from Barbora cluster.

On Barbora cluster all queues/partitions provide full node allocation, whole nodes are allocated to job.

On Complementary systems only some queues/partitions provide full node allocation, see Complementary systems documentation for details.

Getting Job Information

Show all jobs on system:

$ squeue

Show my jobs:

$ squeue --me
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               104   qcpu    interact    user   R       1:48      2 cn[101-102]

Show job details for specific job:

$ scontrol show job JOBID

Show job details for executing job from job session:

$ scontrol show job $SLURM_JOBID

Show my jobs using long output format which includes time limit:

$ squeue --me -l

Show my jobs in running state:

$ squeue --me -t running

Show my jobs in pending state:

$ squeue --me -t pending

Show jobs for given project:

$ squeue -A PROJECT-ID

Running Interactive Jobs

Run interactive job - queue qcpu_exp, one node by default, one task by default:

$ salloc -A PROJECT-ID -p qcpu_exp

Run interactive job on four nodes, 36 tasks per node (Barbora cluster, cpu partition recommended value based on node core count), two hours time limit:

$ salloc -A PROJECT-ID -p qcpu -N 4 --ntasks-per-node 36 -t 2:00:00

Run interactive job, with X11 forwarding:

$ salloc -A PROJECT-ID -p qcpu_exp --x11

To finish the interactive job, you can either use the exit keyword, or Ctrl+D (^D) control sequence.

!!! warning Do not use srun for initiating interactive jobs, subsequent srun, mpirun invocations would block forever.

Running Batch Jobs

Run batch job:

$ cd my_work_dir # submit directory my_work_dir will be also used as working directory for submitted job
$ sbatch script.sh

File script content:

#!/usr/bin/bash
#SBATCH -J MyJobName
#SBATCH -A PROJECT-ID
#SBATCH -p qcpu
#SBATCH -N 4
#SBATCH --ntasks-per-node 36
#SBATCH -t 12:00:00

ml OpenMPI/4.1.4-GCC-11.3.0

srun hostname | uniq -c

Useful command options (salloc, sbatch, srun)

  • -N, --nodes
  • --ntasks-per-node
  • -n, --ntasks
  • -c, --cpus-per-task

Job Environment Variables

Slurm provides useful information to the job via environment variables. Environment variables are available on all nodes allocated to job when accessed via Slurm supported means (srun, compatible mpirun).

See all Slurm variables

$ set | grep ^SLURM

Useful Variables

variable name description example
SLURM_JOBID job id of the executing job 593
SLURM_JOB_NODELIST nodes allocated to the job cn[101-102]
SLURM_JOB_NUM_NODES number of nodes allocated to the job 2
SLURM_STEP_NODELIST nodes allocated to the job step cn101
SLURM_STEP_NUM_NODES number of nodes allocated to the job step 1
SLURM_JOB_PARTITION name of the partition qcpu
SLURM_SUBMIT_DIR submit directory /scratch/project/open-xx-yy/work

See relevant Slurm srun documentation for details.

Get job nodelist:

$ echo $SLURM_JOB_NODELIST
cn[101-102]

Expand nodelist to list of nodes:

$ scontrol show hostnames
cn101
cn102

Modifying Jobs

In general:

$ scontrol update JobId=JOBID ATTR=VALUE

Modify job's time limit:

$ scontrol update job JOBID timelimit=4:00:00

Set/modify job's comment:

$ scontrol update JobId=JOBID Comment='The best job ever'

Deleting Jobs

Delete job by job id:

$ scancel JOBID

Delete all my jobs:

$ scancel --me

Delete all my jobs in interactive mode, confirming every action:

$ scancel --me -i

Delete all my running jobs:

$ scancel --me -t running

Delete all my pending jobs:

$ scancel --me -t pending

Delete all my pending jobs for project PROJECT-ID:

$ scancel --me -t pending -A PROJECT-ID