Skip to content
Snippets Groups Projects
Commit 666015b2 authored by Jan Siwiec's avatar Jan Siwiec
Browse files

proofread

parent 023e5e35
No related branches found
No related tags found
No related merge requests found
Pipeline #33328 passed with warnings
......@@ -2,9 +2,12 @@
## Introduction
[Slurm][1] workload manager is used to allocate and access Barbora cluster and Complementary systems resources. Karolina cluster coming soon...
[Slurm][1] workload manager is used to allocate and access Barbora's and Complementary systems' resources.
Slurm on Karolina will be implemented later in 2023.
A `man` page exists for all Slurm commands, as well as `--help` command option, which provides a brief summary of options. Slurm [documentation][c] and [man pages][d] are also available online.
A `man` page exists for all Slurm commands, as well as the `--help` command option,
which provides a brief summary of options.
Slurm [documentation][c] and [man pages][d] are also available online.
## Getting Partition Information
......@@ -29,29 +32,32 @@ qdgx up 2-00:00:00 0/1/0/1 cn202
qviz up 8:00:00 0/2/0/2 vizserv[1-2]
```
`NODES(A/I/O/T)` column sumarizes node count per state, where the `A/I/O/T` stands for `allocated/idle/other/total`.
`NODES(A/I/O/T)` column summarizes node count per state, where the `A/I/O/T` stands for `allocated/idle/other/total`.
Example output is from Barbora cluster.
Graphical representation of clusters' usage, partitions, nodes, and jobs could be found
* for Barbora cluster at [https://extranet.it4i.cz/rsweb/barbora][4]
* for Barbora at [https://extranet.it4i.cz/rsweb/barbora][4]
* for Complementary Systems at [https://extranet.it4i.cz/rsweb/compsys][6]
On Barbora cluster all queues/partitions provide full node allocation, whole nodes are allocated to job.
On Barbora, all queues/partitions provide full node allocation, whole nodes are allocated to job.
On Complementary systems only some queues/partitions provide full node allocation, see [Complementary systems documentation][2] for details.
On Complementary systems, only some queues/partitions provide full node allocation,
see [Complementary systems documentation][2] for details.
## Running Interactive Jobs
Sometimes you may want to run your job interactively, for example for debugging, running your commands one by one from the command line.
Sometimes you may want to run your job interactively, for example for debugging,
running your commands one by one from the command line.
Run interactive job - queue qcpu_exp, one node by default, one task by default:
Run interactive job - queue `qcpu_exp`, one node by default, one task by default:
```console
$ salloc -A PROJECT-ID -p qcpu_exp
```
Run interactive job on four nodes, 36 tasks per node (Barbora cluster, cpu partition recommended value based on node core count), two hours time limit:
Run interactive job on four nodes, 36 tasks per node (Barbora cluster, CPU partition recommended value based on node core count),
two hours time limit:
```console
$ salloc -A PROJECT-ID -p qcpu -N 4 --ntasks-per-node 36 -t 2:00:00
......@@ -103,7 +109,8 @@ Script will:
* load appropriate module
* run command, srun serves as Slurm's native way of executing MPI-enabled applications, hostname is used in the example just for sake of simplicity
Submit directory will be used as working directory for submitted job, so there is no need to change directory in the job script.
Submit directory will be used as working directory for submitted job,
so there is no need to change directory in the job script.
Alternatively you can specify job working directory using sbatch `--chdir` (or shortly `-D`) option.
### Job Submit
......@@ -115,9 +122,10 @@ $ cd my_work_dir
$ sbatch script.sh
```
Path to script.sh (relative or absolute) should be given if job script is in different location than job working directory.
A path to `script.sh` (relative or absolute) should be given
if the job script is in different location than the job working directory.
By default, job output is stored in a file called slurm-JOBID.out and contains both job standard output and error output.
By default, job output is stored in a file called `slurm-JOBID.out` and contains both job standard output and error output.
This can be changed using sbatch options `--output` (shortly `-o`) and `--error` (shortly `-e`).
Example output of the job:
......@@ -131,7 +139,8 @@ Example output of the job:
### Job Environment Variables
Slurm provides useful information to the job via environment variables. Environment variables are available on all nodes allocated to job when accessed via Slurm supported means (srun, compatible mpirun).
Slurm provides useful information to the job via environment variables.
Environment variables are available on all nodes allocated to job when accessed via Slurm supported means (srun, compatible mpirun).
See all Slurm variables
......@@ -186,7 +195,7 @@ $ squeue --me
104 qcpu interact user R 1:48 2 cn[101-102]
```
Show job details for specific job:
Show job details for a specific job:
```console
$ scontrol show job JOBID
......@@ -198,7 +207,7 @@ Show job details for executing job from job session:
$ scontrol show job $SLURM_JOBID
```
Show my jobs using long output format which includes time limit:
Show my jobs using a long output format which includes time limit:
```console
$ squeue --me -l
......@@ -216,7 +225,7 @@ Show my jobs in pending state:
$ squeue --me -t pending
```
Show jobs for given project:
Show jobs for a given project:
```console
$ squeue -A PROJECT-ID
......@@ -226,20 +235,20 @@ $ squeue -A PROJECT-ID
The most common job states are (in alphabetical order):
| Code | Job State | Explanation |
| :--: | :------------ | :------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| CA | CANCELLED | Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated. |
| CD | COMPLETED | Job has terminated all processes on all nodes with an exit code of zero. |
| CG | COMPLETING | Job is in the process of completing. Some processes on some nodes may still be active. |
| F | FAILED | Job terminated with non-zero exit code or other failure condition. |
| NF | NODE_FAIL | Job terminated due to failure of one or more allocated nodes. |
| OOM | OUT_OF_MEMORY | Job experienced out of memory error. |
| PD | PENDING | Job is awaiting resource allocation. |
| PR | PREEMPTED | Job terminated due to preemption. |
| R | RUNNING | Job currently has an allocation. |
| RQ | REQUEUED | Completing job is being requeued. |
| SI | SIGNALING | Job is being signaled. |
| TO | TIMEOUT | Job terminated upon reaching its time limit. |
| Code | Job State | Explanation |
| :--: | :------------ | :------------------------------------------------------------------------------------------------------------- |
| CA | CANCELLED | Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated. |
| CD | COMPLETED | Job has terminated all processes on all nodes with an exit code of zero. |
| CG | COMPLETING | Job is in the process of completing. Some processes on some nodes may still be active. |
| F | FAILED | Job terminated with non-zero exit code or other failure condition. |
| NF | NODE_FAIL | Job terminated due to failure of one or more allocated nodes. |
| OOM | OUT_OF_MEMORY | Job experienced out of memory error. |
| PD | PENDING | Job is awaiting resource allocation. |
| PR | PREEMPTED | Job terminated due to preemption. |
| R | RUNNING | Job currently has an allocation. |
| RQ | REQUEUED | Completing job is being requeued. |
| SI | SIGNALING | Job is being signaled. |
| TO | TIMEOUT | Job terminated upon reaching its time limit. |
### Modifying Jobs
......@@ -263,7 +272,7 @@ $ scontrol update JobId=JOBID Comment='The best job ever'
### Deleting Jobs
Delete job by job id:
Delete a job by job ID:
```
$ scancel JOBID
......@@ -293,7 +302,7 @@ Delete all my pending jobs:
$ scancel --me -t pending
```
Delete all my pending jobs for project PROJECT-ID:
Delete all my pending jobs for a project PROJECT-ID:
```
$ scancel --me -t pending -A PROJECT-ID
......@@ -307,10 +316,10 @@ $ scancel --me -t pending -A PROJECT-ID
Possible causes:
* Invalid account (i.e. project) was specified in job submission
* User does not have access to given account/project
* Given account/project does not have access to given partition
* Access to given partition was retracted due to the project's allocation exhaustion
* Invalid account (i.e. project) was specified in job submission.
* User does not have access to given account/project.
* Given account/project does not have access to given partition.
* Access to given partition was retracted due to the project's allocation exhaustion.
[1]: https://slurm.schedmd.com/
[2]: /cs/job-scheduling/#partitions
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment