proofread

666015b2 · Jan Siwiec · 023e5e35 · 666015b2
Commit 666015b2 authored 1 year ago by Jan Siwiec
--- a/docs.it4i/general/slurm-job-submission-and-execution.md
+++ b/docs.it4i/general/slurm-job-submission-and-execution.md
@@ -2,9 +2,12 @@

 ## Introduction

-[Slurm][1] workload manager is used to allocate and access Barbora cluster and Complementary systems resources. Karolina cluster coming soon...
+[Slurm][1] workload manager is used to allocate and access Barbora's and Complementary systems' resources.
+Slurm on Karolina will be implemented later in 2023.

-A `man` page exists for all Slurm commands, as well as `--help` command option, which provides a brief summary of options. Slurm [documentation][c] and [man pages][d] are also available online.
+A `man` page exists for all Slurm commands, as well as the `--help` command option,
+which provides a brief summary of options.
+Slurm [documentation][c] and [man pages][d] are also available online.

 ## Getting Partition Information

@@ -29,29 +32,32 @@ qdgx            up 2-00:00:00          0/1/0/1 cn202
 qviz            up    8:00:00          0/2/0/2 vizserv[1-2]
 ```

-`NODES(A/I/O/T)` column sumarizes node count per state, where the `A/I/O/T` stands for `allocated/idle/other/total`.
+`NODES(A/I/O/T)` column summarizes node count per state, where the `A/I/O/T` stands for `allocated/idle/other/total`.
 Example output is from Barbora cluster.

 Graphical representation of clusters' usage, partitions, nodes, and jobs could be found

-* for Barbora cluster at [https://extranet.it4i.cz/rsweb/barbora][4]
+* for Barbora at [https://extranet.it4i.cz/rsweb/barbora][4]
 * for Complementary Systems at [https://extranet.it4i.cz/rsweb/compsys][6]

-On Barbora cluster all queues/partitions provide full node allocation, whole nodes are allocated to job.
+On Barbora, all queues/partitions provide full node allocation, whole nodes are allocated to job.

-On Complementary systems only some queues/partitions provide full node allocation, see [Complementary systems documentation][2] for details.
+On Complementary systems, only some queues/partitions provide full node allocation,
+see [Complementary systems documentation][2] for details.

 ## Running Interactive Jobs

-Sometimes you may want to run your job interactively, for example for debugging, running your commands one by one from the command line.
+Sometimes you may want to run your job interactively, for example for debugging,
+running your commands one by one from the command line.

-Run interactive job - queue qcpu_exp, one node by default, one task by default:
+Run interactive job - queue `qcpu_exp`, one node by default, one task by default:

 ```console
 $ salloc -A PROJECT-ID -p qcpu_exp
 ```

-Run interactive job on four nodes, 36 tasks per node (Barbora cluster, cpu partition recommended value based on node core count), two hours time limit:
+Run interactive job on four nodes, 36 tasks per node (Barbora cluster, CPU partition recommended value based on node core count),
+two hours time limit:

 ```console
 $ salloc -A PROJECT-ID -p qcpu -N 4 --ntasks-per-node 36 -t 2:00:00
@@ -103,7 +109,8 @@ Script will:
 * load appropriate module
 * run command, srun serves as Slurm's native way of executing MPI-enabled applications, hostname is used in the example just for sake of simplicity

-Submit directory will be used as working directory for submitted job, so there is no need to change directory in the job script.
+Submit directory will be used as working directory for submitted job,
+so there is no need to change directory in the job script.
 Alternatively you can specify job working directory using sbatch `--chdir` (or shortly `-D`) option.

 ### Job Submit
@@ -115,9 +122,10 @@ $ cd my_work_dir
 $ sbatch script.sh
 ```

-Path to script.sh (relative or absolute) should be given if job script is in different location than job working directory.
+A path to `script.sh` (relative or absolute) should be given
+if the job script is in different location than the job working directory.

-By default, job output is stored in a file called slurm-JOBID.out and contains both job standard output and error output.
+By default, job output is stored in a file called `slurm-JOBID.out` and contains both job standard output and error output.
 This can be changed using sbatch options `--output` (shortly `-o`) and `--error` (shortly `-e`).

 Example output of the job:
@@ -131,7 +139,8 @@ Example output of the job:

 ### Job Environment Variables

-Slurm provides useful information to the job via environment variables. Environment variables are available on all nodes allocated to job when accessed via Slurm supported means (srun, compatible mpirun).
+Slurm provides useful information to the job via environment variables.
+Environment variables are available on all nodes allocated to job when accessed via Slurm supported means (srun, compatible mpirun).

 See all Slurm variables

@@ -186,7 +195,7 @@ $ squeue --me
               104   qcpu    interact    user   R       1:48      2 cn[101-102]
 ```

-Show job details for specific job:
+Show job details for a specific job:

 ```console
 $ scontrol show job JOBID
@@ -198,7 +207,7 @@ Show job details for executing job from job session:
 $ scontrol show job $SLURM_JOBID
 ```

-Show my jobs using long output format which includes time limit:
+Show my jobs using a long output format which includes time limit:

 ```console
 $ squeue --me -l
@@ -216,7 +225,7 @@ Show my jobs in pending state:
 $ squeue --me -t pending
 ```

-Show jobs for given project:
+Show jobs for a given project:

 ```console
 $ squeue -A PROJECT-ID
@@ -226,20 +235,20 @@ $ squeue -A PROJECT-ID

 The most common job states are (in alphabetical order):

-| Code | Job State     | Explanation                                                                                                                                                    |
-| :--: | :------------ | :------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| CA   | CANCELLED     | Job was explicitly cancelled by the user or system administrator.  The job may or may not have been initiated.                                                 |
-| CD   | COMPLETED     | Job has terminated all processes on all nodes with an exit code of zero.                                                                                       |
-| CG   | COMPLETING    | Job is in the process of completing. Some processes on some nodes may still be active.                                                                         |
-| F    | FAILED        | Job terminated with non-zero exit code or other failure condition.                                                                                             |
-| NF   | NODE_FAIL     | Job terminated due to failure of one or more allocated nodes.                                                                                                  |
-| OOM  | OUT_OF_MEMORY | Job experienced out of memory error.                                                                                                                           |
-| PD   | PENDING       | Job is awaiting resource allocation.                                                                                                                           |
-| PR   | PREEMPTED     | Job terminated due to preemption.                                                                                                                              |
-| R    | RUNNING       | Job currently has an allocation.                                                                                                                               |
-| RQ   | REQUEUED      | Completing job is being requeued.                                                                                                                              |
-| SI   | SIGNALING     | Job is being signaled.                                                                                                                                         |
-| TO   | TIMEOUT       | Job terminated upon reaching its time limit.                                                                                                                   |
+| Code | Job State     | Explanation                                                                                                    |
+| :--: | :------------ | :------------------------------------------------------------------------------------------------------------- |
+| CA   | CANCELLED     | Job was explicitly cancelled by the user or system administrator.  The job may or may not have been initiated. |
+| CD   | COMPLETED     | Job has terminated all processes on all nodes with an exit code of zero.                                       |
+| CG   | COMPLETING    | Job is in the process of completing. Some processes on some nodes may still be active.                         |
+| F    | FAILED        | Job terminated with non-zero exit code or other failure condition.                                             |
+| NF   | NODE_FAIL     | Job terminated due to failure of one or more allocated nodes.                                                  |
+| OOM  | OUT_OF_MEMORY | Job experienced out of memory error.                                                                           |
+| PD   | PENDING       | Job is awaiting resource allocation.                                                                           |
+| PR   | PREEMPTED     | Job terminated due to preemption.                                                                              |
+| R    | RUNNING       | Job currently has an allocation.                                                                               |
+| RQ   | REQUEUED      | Completing job is being requeued.                                                                              |
+| SI   | SIGNALING     | Job is being signaled.                                                                                         |
+| TO   | TIMEOUT       | Job terminated upon reaching its time limit.                                                                   |

 ### Modifying Jobs

@@ -263,7 +272,7 @@ $ scontrol update JobId=JOBID Comment='The best job ever'

 ### Deleting Jobs

-Delete job by job id:
+Delete a job by job ID:

 ```
 $ scancel JOBID
@@ -293,7 +302,7 @@ Delete all my pending jobs:
 $ scancel --me -t pending
 ```

-Delete all my pending jobs for project PROJECT-ID:
+Delete all my pending jobs for a project PROJECT-ID:

 ```
 $ scancel --me -t pending -A PROJECT-ID
@@ -307,10 +316,10 @@ $ scancel --me -t pending -A PROJECT-ID

 Possible causes:

-* Invalid account (i.e. project) was specified in job submission
-* User does not have access to given account/project
-* Given account/project does not have access to given partition
-* Access to given partition was retracted due to the project's allocation exhaustion
+* Invalid account (i.e. project) was specified in job submission.
+* User does not have access to given account/project.
+* Given account/project does not have access to given partition.
+* Access to given partition was retracted due to the project's allocation exhaustion.

 [1]: https://slurm.schedmd.com/
 [2]: /cs/job-scheduling/#partitions