Skip to content
Snippets Groups Projects
slurm-job-submission-and-execution.md 11.2 KiB
Newer Older
  • Learn to ignore specific revisions
  • # Job Submission and Execution
    
        Don't use the `#SBATCH --exclusive` parameter as it is already included in the SLURM configuration.<br><br>
    
        Use the `#SBATCH --mem=` parameter **on `qfat` only**. On `cpu_` queues, whole nodes are allocated.
    
        Accelerated nodes (`gpu_` queues) are divided each into eight parts with corresponding memory.
    
    [Slurm][1] workload manager is used to allocate and access Karolina's, Barbora's and Complementary systems' resources.
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    A `man` page exists for all Slurm commands, as well as the `--help` command option,
    which provides a brief summary of options.
    Slurm [documentation][c] and [man pages][d] are also available online.
    
    ## Getting Partition Information
    
    Display partitions/queues on system:
    
    
    ```console
    $ sinfo -s
    
    PARTITION    AVAIL  TIMELIMIT   NODES(A/I/O/T) NODELIST
    
    qcpu*           up 2-00:00:00      1/191/0/192 cn[1-192]
    qcpu_biz        up 2-00:00:00      1/191/0/192 cn[1-192]
    qcpu_exp        up    1:00:00      1/191/0/192 cn[1-192]
    qcpu_free       up   18:00:00      1/191/0/192 cn[1-192]
    qcpu_long       up 6-00:00:00      1/191/0/192 cn[1-192]
    qcpu_preempt    up   12:00:00      1/191/0/192 cn[1-192]
    
    qgpu            up 2-00:00:00          0/8/0/8 cn[193-200]
    qgpu_biz        up 2-00:00:00          0/8/0/8 cn[193-200]
    qgpu_exp        up    1:00:00          0/8/0/8 cn[193-200]
    qgpu_free       up   18:00:00          0/8/0/8 cn[193-200]
    qgpu_preempt    up   12:00:00          0/8/0/8 cn[193-200]
    qfat            up 2-00:00:00          0/1/0/1 cn201
    qdgx            up 2-00:00:00          0/1/0/1 cn202
    qviz            up    8:00:00          0/2/0/2 vizserv[1-2]
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    `NODES(A/I/O/T)` column summarizes node count per state, where the `A/I/O/T` stands for `allocated/idle/other/total`.
    
    Example output is from Barbora cluster.
    
    Graphical representation of clusters' usage, partitions, nodes, and jobs could be found
    
    
    Roman Sliva's avatar
    Roman Sliva committed
    * for Karolina at [https://extranet.it4i.cz/rsweb/karolina][5]
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    * for Barbora at [https://extranet.it4i.cz/rsweb/barbora][4]
    
    * for Complementary Systems at [https://extranet.it4i.cz/rsweb/compsys][6]
    
    
    Roman Sliva's avatar
    Roman Sliva committed
    On Karolina cluster
    
    * all cpu queues/partitions provide full node allocation, whole nodes are allocated to job
    * other queues/partitions (gpu, fat, viz) provide partial node allocation
    
    See [Karolina Slurm Specifics][7] for details.
    
    On Barbora cluster, all queues/partitions provide full node allocation, whole nodes are allocated to job.
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    On Complementary systems, only some queues/partitions provide full node allocation,
    see [Complementary systems documentation][2] for details.
    
    ## Running Interactive Jobs
    
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    Sometimes you may want to run your job interactively, for example for debugging,
    running your commands one by one from the command line.
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    Run interactive job - queue `qcpu_exp`, one node by default, one task by default:
    
    $ salloc -A PROJECT-ID -p qcpu_exp
    
    Run interactive job on four nodes, 128 tasks per node (Karolina cluster, CPU partition recommended value based on node core count),
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    two hours time limit:
    
    $ salloc -A PROJECT-ID -p qcpu -N 4 --ntasks-per-node 128 -t 2:00:00
    
    Run interactive job, with X11 forwarding:
    
    $ salloc -A PROJECT-ID -p qcpu_exp --x11
    
    To finish the interactive job, use the Ctrl+D (`^D`) control sequence.
    
    !!! warning
        Do not use `srun` for initiating interactive jobs, subsequent `srun`, `mpirun` invocations would block forever.
    
    Batch jobs is the standard way of running jobs and utilizing HPC clusters.
    
    
    Create example job script called script.sh with the following content:
    
    #SBATCH --job-name MyJobName
    #SBATCH --account PROJECT-ID
    #SBATCH --partition qcpu
    #SBATCH --nodes 4
    
    #SBATCH --ntasks-per-node 128
    
    #SBATCH --time 12:00:00
    
    ml OpenMPI/4.1.4-GCC-11.3.0
    
    srun hostname | sort | uniq -c
    
    * use bash shell interpreter
    
    * use `MyJobName` as job name
    * use project `PROJECT-ID` for job access and accounting
    * use partition/queue `qcpu`
    * use `4` nodes
    * use `128` tasks per node - value used by MPI
    * set job time limit to `12` hours
    
    * load appropriate module
    
    * run command, `srun` serves as Slurm's native way of executing MPI-enabled applications, `hostname` is used in the example just for sake of simplicity
    
    !!! tip "Excluding Specific Nodes"
    
        Use `#SBATCH --exclude=<node_name_list>` directive to exclude specific nodes from your job, e.g.: `#SBATCH --exclude=cn001,cn002,cn003`.
    
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    Submit directory will be used as working directory for submitted job,
    so there is no need to change directory in the job script.
    
    Alternatively you can specify job working directory using the sbatch `--chdir` (or shortly `-D`) option.
    
    Pavel Holba's avatar
    Pavel Holba committed
    ### Srun Over mpirun
    
    While `mpirun` can be used to run parallel jobs on our Slurm-managed clusters, we recommend using `srun` for better integration with Slurm's scheduling and resource management. `srun` ensures more efficient job execution and resource control by leveraging Slurm’s features directly, and it simplifies the process by reducing the need for additional configurations often required with `mpirun`.
    
    
    ### Job Submit
    
    Submit batch job:
    
    $ sbatch script.sh
    ```
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    A path to `script.sh` (relative or absolute) should be given
    
    if the job script is in a different location than the job working directory.
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    By default, job output is stored in a file called `slurm-JOBID.out` and contains both job standard output and error output.
    
    This can be changed using the sbatch options `--output` (shortly `-o`) and `--error` (shortly `-e`).
    
    Example output of the job:
    
         128 cn017.karolina.it4i.cz
         128 cn018.karolina.it4i.cz
         128 cn019.karolina.it4i.cz
         128 cn020.karolina.it4i.cz
    
    ### Job Environment Variables
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    Slurm provides useful information to the job via environment variables.
    
    Environment variables are available on all nodes allocated to job when accessed via Slurm supported means (`srun`, compatible `mpirun`).
    
    See all Slurm variables
    
    $ set | grep ^SLURM
    
    Commonly used variables are:
    
    
    | variable name | description | example |
    | ------ | ------ | ------ |
    
    | SLURM_JOB_ID | job id of the executing job| 593 |
    
    | SLURM_JOB_NODELIST | nodes allocated to the job | cn[101-102] |
    | SLURM_JOB_NUM_NODES | number of nodes allocated to the job | 2 |
    | SLURM_STEP_NODELIST | nodes allocated to the job step | cn101 |
    | SLURM_STEP_NUM_NODES | number of nodes allocated to the job step | 1 |
    | SLURM_JOB_PARTITION | name of the partition | qcpu |
    | SLURM_SUBMIT_DIR | submit directory | /scratch/project/open-xx-yy/work |
    
    See relevant [Slurm documentation][3] for details.
    
    Get job nodelist:
    
    ```
    $ echo $SLURM_JOB_NODELIST
    cn[101-102]
    
    Expand nodelist to list of nodes:
    
    $ scontrol show hostnames
    
    ## Job Management
    
    ### Getting Job Information
    
    Show all jobs on system:
    
    ```console
    $ squeue
    ```
    
    Show my jobs:
    
    ```console
    $ squeue --me
                 JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                   104   qcpu    interact    user   R       1:48      2 cn[101-102]
    ```
    
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    Show job details for a specific job:
    
    
    ```console
    $ scontrol show job JOBID
    ```
    
    Show job details for executing job from job session:
    
    ```console
    $ scontrol show job $SLURM_JOBID
    ```
    
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    Show my jobs using a long output format which includes time limit:
    
    
    ```console
    $ squeue --me -l
    ```
    
    Show my jobs in running state:
    
    ```console
    $ squeue --me -t running
    ```
    
    Show my jobs in pending state:
    
    ```console
    $ squeue --me -t pending
    ```
    
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    Show jobs for a given project:
    
    
    ```console
    $ squeue -A PROJECT-ID
    ```
    
    ### Job States
    
    
    The most common job states are (in alphabetical order):
    
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    | Code | Job State     | Explanation                                                                                                    |
    | :--: | :------------ | :------------------------------------------------------------------------------------------------------------- |
    | CA   | CANCELLED     | Job was explicitly cancelled by the user or system administrator.  The job may or may not have been initiated. |
    | CD   | COMPLETED     | Job has terminated all processes on all nodes with an exit code of zero.                                       |
    | CG   | COMPLETING    | Job is in the process of completing. Some processes on some nodes may still be active.                         |
    | F    | FAILED        | Job terminated with non-zero exit code or other failure condition.                                             |
    | NF   | NODE_FAIL     | Job terminated due to failure of one or more allocated nodes.                                                  |
    | OOM  | OUT_OF_MEMORY | Job experienced out of memory error.                                                                           |
    | PD   | PENDING       | Job is awaiting resource allocation.                                                                           |
    | PR   | PREEMPTED     | Job terminated due to preemption.                                                                              |
    | R    | RUNNING       | Job currently has an allocation.                                                                               |
    | RQ   | REQUEUED      | Completing job is being requeued.                                                                              |
    | SI   | SIGNALING     | Job is being signaled.                                                                                         |
    | TO   | TIMEOUT       | Job terminated upon reaching its time limit.                                                                   |
    
    ### Modifying Jobs
    
    ```
    $ scontrol update JobId=JOBID ATTR=VALUE
    
    Modify job's time limit:
    
    $ scontrol update JobId=JOBID timelimit=4:00:00
    
    Set/modify job's comment:
    
    ```
    $ scontrol update JobId=JOBID Comment='The best job ever'
    
    ### Deleting Jobs
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    Delete a job by job ID:
    
    Delete all my jobs:
    
    Delete all my jobs in interactive mode, confirming every action:
    
    Delete all my running jobs:
    
    Delete all my pending jobs:
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    Delete all my pending jobs for a project PROJECT-ID:
    
    
    ```
    $ scancel --me -t pending -A PROJECT-ID
    ```
    
    
    ## Troubleshooting
    
    ### Invalid Account
    
    `sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified`
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    * Invalid account (i.e. project) was specified in job submission.
    * User does not have access to given account/project.
    * Given account/project does not have access to given partition.
    * Access to given partition was retracted due to the project's allocation exhaustion.
    
    [1]: https://slurm.schedmd.com/
    
    [2]: /cs/job-scheduling/#partitions
    
    [3]: https://slurm.schedmd.com/srun.html#SECTION_OUTPUT-ENVIRONMENT-VARIABLES
    
    [4]: https://extranet.it4i.cz/rsweb/barbora
    [5]: https://extranet.it4i.cz/rsweb/karolina
    [6]: https://extranet.it4i.cz/rsweb/compsys
    
    Roman Sliva's avatar
    Roman Sliva committed
    [7]: /general/karolina-slurm
    
    [a]: https://slurm.schedmd.com/
    [b]: http://slurmlearning.deic.dk/
    [c]: https://slurm.schedmd.com/documentation.html
    [d]: https://slurm.schedmd.com/man_index.html
    [e]: https://slurm.schedmd.com/sinfo.html
    [f]: https://slurm.schedmd.com/squeue.html
    [g]: https://slurm.schedmd.com/scancel.html
    [h]: https://slurm.schedmd.com/scontrol.html
    [i]: https://slurm.schedmd.com/job_array.html