Skip to content
Snippets Groups Projects
Commit 18d3d49b authored by Roman Sliva's avatar Roman Sliva
Browse files

Update slurm-job-submission-and-execution.md

parent f09f7a5d
No related branches found
No related tags found
No related merge requests found
Pipeline #33208 passed with warnings
...@@ -300,9 +300,9 @@ $ sbatch --parsable --dependency=afterok:${second} job3.sh ...@@ -300,9 +300,9 @@ $ sbatch --parsable --dependency=afterok:${second} job3.sh
1581 1581
$ squeue $ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1579 p01-arm job1 opr0019 PD 0:00 8 (Priority) 1579 p01-arm job1 user017 PD 0:00 8 (Priority)
1580 p01-arm job2 opr0019 PD 0:00 8 (Dependency) 1580 p01-arm job2 user017 PD 0:00 8 (Dependency)
1581 p01-arm job3 opr0019 PD 0:00 8 (Dependency) 1581 p01-arm job3 user017 PD 0:00 8 (Dependency)
$ scontrol show job 1580 | grep JobState $ scontrol show job 1580 | grep JobState
JobState=PENDING Reason=Dependency Dependency=afterok:1579(unfulfilled) JobState=PENDING Reason=Dependency Dependency=afterok:1579(unfulfilled)
$ scontrol show job 1581 | grep JobState $ scontrol show job 1581 | grep JobState
...@@ -329,12 +329,12 @@ $ for id in $(seq 2 6) ...@@ -329,12 +329,12 @@ $ for id in $(seq 2 6)
> done > done
$ squeue $ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1582 p01-arm jobstart opr0019 PD 0:00 8 (Priority) 1582 p01-arm jobstart user017 PD 0:00 8 (Priority)
1583 p01-arm restart opr0019 PD 0:00 8 (Dependency) 1583 p01-arm restart user017 PD 0:00 8 (Dependency)
1584 p01-arm restart opr0019 PD 0:00 8 (Dependency) 1584 p01-arm restart user017 PD 0:00 8 (Dependency)
1585 p01-arm restart opr0019 PD 0:00 8 (Dependency) 1585 p01-arm restart user017 PD 0:00 8 (Dependency)
1586 p01-arm restart opr0019 PD 0:00 8 (Dependency) 1586 p01-arm restart user017 PD 0:00 8 (Dependency)
1587 p01-arm restart opr0019 PD 0:00 8 (Dependency) 1587 p01-arm restart user017 PD 0:00 8 (Dependency)
$ for id in $(seq 1582 1587) $ for id in $(seq 1582 1587)
> do > do
> scontrol show job $id | grep JobState > scontrol show job $id | grep JobState
...@@ -371,13 +371,26 @@ To view information about available job partitions, use the `sinfo` command: ...@@ -371,13 +371,26 @@ To view information about available job partitions, use the `sinfo` command:
```console ```console
$ sinfo $ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
p00-arm up 1-00:00:00 1 idle p00-arm01 qcpu* up 2-00:00:00 191 idle cn[1-67,69-192]
p01-arm* up 1-00:00:00 8 idle p01-arm[01-08] qcpu_biz up 2-00:00:00 1 alloc cn68
p02-intel up 1-00:00:00 2 idle p02-intel[01-02] qcpu_biz up 2-00:00:00 191 idle cn[1-67,69-192]
p03-amd up 1-00:00:00 2 idle p03-amd[01-02] qcpu_exp up 1:00:00 1 alloc cn68
p04-edge up 1-00:00:00 1 idle p04-edge01 qcpu_exp up 1:00:00 191 idle cn[1-67,69-192]
p05-synt up 1-00:00:00 1 idle p05-synt01 qcpu_free up 18:00:00 1 alloc cn68
qcpu_free up 18:00:00 191 idle cn[1-67,69-192]
qcpu_long up 6-00:00:00 1 alloc cn68
qcpu_long up 6-00:00:00 191 idle cn[1-67,69-192]
qcpu_preempt up 12:00:00 1 alloc cn68
qcpu_preempt up 12:00:00 191 idle cn[1-67,69-192]
qgpu up 2-00:00:00 8 idle cn[193-200]
qgpu_biz up 2-00:00:00 8 idle cn[193-200]
qgpu_exp up 1:00:00 8 idle cn[193-200]
qgpu_free up 18:00:00 8 idle cn[193-200]
qgpu_preempt up 12:00:00 8 idle cn[193-200]
qfat up 2-00:00:00 1 idle cn201
qdgx up 2-00:00:00 1 idle cn202
qviz up 8:00:00 2 idle vizserv[1-2]
``` ```
Here we can see output of the `sinfo` command ran on Barbora cluster. By default, it shows basic node and partition configurations. Here we can see output of the `sinfo` command ran on Barbora cluster. By default, it shows basic node and partition configurations.
...@@ -387,12 +400,12 @@ To view partition summary information, use `sinfo -s`, or `sinfo --summarize`: ...@@ -387,12 +400,12 @@ To view partition summary information, use `sinfo -s`, or `sinfo --summarize`:
```console ```console
$ sinfo -s $ sinfo -s
PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST
qcpu* up 2-00:00:00 0/192/0/192 cn[1-192] qcpu* up 2-00:00:00 1/191/0/192 cn[1-192]
qcpu_biz up 2-00:00:00 0/192/0/192 cn[1-192] qcpu_biz up 2-00:00:00 1/191/0/192 cn[1-192]
qcpu_exp up 1:00:00 0/192/0/192 cn[1-192] qcpu_exp up 1:00:00 1/191/0/192 cn[1-192]
qcpu_free up 18:00:00 0/192/0/192 cn[1-192] qcpu_free up 18:00:00 1/191/0/192 cn[1-192]
qcpu_long up 6-00:00:00 0/192/0/192 cn[1-192] qcpu_long up 6-00:00:00 1/191/0/192 cn[1-192]
qcpu_preempt up 12:00:00 0/192/0/192 cn[1-192] qcpu_preempt up 12:00:00 1/191/0/192 cn[1-192]
qgpu up 2-00:00:00 0/8/0/8 cn[193-200] qgpu up 2-00:00:00 0/8/0/8 cn[193-200]
qgpu_biz up 2-00:00:00 0/8/0/8 cn[193-200] qgpu_biz up 2-00:00:00 0/8/0/8 cn[193-200]
qgpu_exp up 1:00:00 0/8/0/8 cn[193-200] qgpu_exp up 1:00:00 0/8/0/8 cn[193-200]
...@@ -437,9 +450,9 @@ To view information about queued jobs, use the `squeue` command: ...@@ -437,9 +450,9 @@ To view information about queued jobs, use the `squeue` command:
```console ```console
$ squeue $ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1556 p01-arm interact opr0019 R 1:07 8 p01-arm[01-08] 1556 p01-arm interact user017 R 1:07 8 p01-arm[01-08]
1558 p02-intel interact easybuil CD 0:21 2 p02-intel[01-02] 1558 p02-intel interact user018 CD 0:21 2 p02-intel[01-02]
1557 p03-amd interact opr0019 R 0:57 1 p03-amd01 1557 p03-amd interact user017 R 0:57 1 p03-amd01
``` ```
By default, this shows the job ID, partition, name of the job, job owner's username, job state, how long has the job been already running, number of allocated nodes, and a list of allocated nodes. By default, this shows the job ID, partition, name of the job, job owner's username, job state, how long has the job been already running, number of allocated nodes, and a list of allocated nodes.
...@@ -449,13 +462,13 @@ To view jobs only belonging to a particular user, you can either use `--user=<us ...@@ -449,13 +462,13 @@ To view jobs only belonging to a particular user, you can either use `--user=<us
```console ```console
$ squeue $ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1559 p01-arm interact opr0019 R 3:37 8 p01-arm[01-08] 1559 p01-arm interact user017 R 3:37 8 p01-arm[01-08]
1560 p02-intel interact easybuil R 0:05 2 p02-intel[01-02] 1560 p02-intel interact user018 R 0:05 2 p02-intel[01-02]
1557 p03-amd interact opr0019 R 10:22 1 p03-amd01 1557 p03-amd interact user017 R 10:22 1 p03-amd01
$ squeue --me $ squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1559 p01-arm interact opr0019 R 4:04 8 p01-arm[01-08] 1559 p01-arm interact user017 R 4:04 8 p01-arm[01-08]
1557 p03-amd interact opr0019 R 10:49 1 p03-amd01 1557 p03-amd interact user017 R 10:49 1 p03-amd01
``` ```
`squeue` also allows for printing information about specific jobs using the `--jobs` flag: `squeue` also allows for printing information about specific jobs using the `--jobs` flag:
...@@ -463,8 +476,8 @@ $ squeue --me ...@@ -463,8 +476,8 @@ $ squeue --me
```console ```console
$ squeue --jobs 1557,1560 $ squeue --jobs 1557,1560
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1560 p02-intel interact easybuil R 2:09 2 p02-intel[01-02] 1560 p02-intel interact user018 R 2:09 2 p02-intel[01-02]
1557 p03-amd interact opr0019 R 12:26 1 p03-amd01 1557 p03-amd interact user017 R 12:26 1 p03-amd01
``` ```
For more information about the `squeue` command, its flags, and formatting options, see the manual, either by using the `man squeue` command or [online][f]. For more information about the `squeue` command, its flags, and formatting options, see the manual, either by using the `man squeue` command or [online][f].
...@@ -505,15 +518,15 @@ For more information about the `squeue` command, its flags, and formatting optio ...@@ -505,15 +518,15 @@ For more information about the `squeue` command, its flags, and formatting optio
```console ```console
$ squeue $ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1563 p01-arm interact opr0019 PD 0:00 3 (Resources) 1563 p01-arm interact user017 PD 0:00 3 (Resources)
1562 p01-arm interact opr0019 R 0:20 8 p01-arm[01-08] 1562 p01-arm interact user017 R 0:20 8 p01-arm[01-08]
1561 p03-amd interact opr0019 R 0:25 1 p03-amd01 1561 p03-amd interact user017 R 0:25 1 p03-amd01
$ scancel 1562 $ scancel 1562
$ squeue $ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1562 p01-arm interact opr0019 CA 0:57 8 p01-arm[01-08] 1562 p01-arm interact user017 CA 0:57 8 p01-arm[01-08]
1563 p01-arm interact opr0019 R 0:03 3 p01-arm[01-03] 1563 p01-arm interact user017 R 0:03 3 p01-arm[01-03]
1561 p03-amd interact opr0019 R 1:06 1 p03-amd01 1561 p03-amd interact user017 R 1:06 1 p03-amd01
``` ```
To cancel multiple jobs, simply list all of their job IDs: To cancel multiple jobs, simply list all of their job IDs:
...@@ -521,19 +534,19 @@ To cancel multiple jobs, simply list all of their job IDs: ...@@ -521,19 +534,19 @@ To cancel multiple jobs, simply list all of their job IDs:
```console ```console
$ squeue $ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1562 p01-arm interact opr0019 CA 0:57 8 p01-arm[01-08] 1562 p01-arm interact user017 CA 0:57 8 p01-arm[01-08]
1564 p01-arm interact opr0019 PD 0:00 8 (Resources) 1564 p01-arm interact user017 PD 0:00 8 (Resources)
1563 p01-arm interact opr0019 R 3:31 3 p01-arm[01-03] 1563 p01-arm interact user017 R 3:31 3 p01-arm[01-03]
1561 p03-amd interact opr0019 R 4:34 1 p03-amd01 1561 p03-amd interact user017 R 4:34 1 p03-amd01
1565 p03-amd interact opr0019 R 0:07 1 p03-amd01 1565 p03-amd interact user017 R 0:07 1 p03-amd01
$ scancel 1562 1563 1561 1565 1564 $ scancel 1562 1563 1561 1565 1564
$ squeue $ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1562 p01-arm interact opr0019 CA 0:57 8 p01-arm[01-08] 1562 p01-arm interact user017 CA 0:57 8 p01-arm[01-08]
1563 p01-arm interact opr0019 CA 4:01 3 p01-arm[01-03] 1563 p01-arm interact user017 CA 4:01 3 p01-arm[01-03]
1564 p01-arm interact opr0019 CA 0:00 8 1564 p01-arm interact user017 CA 0:00 8
1561 p03-amd interact opr0019 CA 5:04 1 p03-amd01 1561 p03-amd interact user017 CA 5:04 1 p03-amd01
1565 p03-amd interact opr0019 CA 0:37 1 p03-amd01 1565 p03-amd interact user017 CA 0:37 1 p03-amd01
``` ```
Slurm also allows for canceling only jobs which fulfill certain criteria, for example, the partition to which they have been submitted, or their job state (or a mixture of both): Slurm also allows for canceling only jobs which fulfill certain criteria, for example, the partition to which they have been submitted, or their job state (or a mixture of both):
...@@ -541,30 +554,30 @@ Slurm also allows for canceling only jobs which fulfill certain criteria, for ex ...@@ -541,30 +554,30 @@ Slurm also allows for canceling only jobs which fulfill certain criteria, for ex
```console ```console
$ squeue $ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1569 p01-arm interact opr0019 PD 0:00 3 (Resources) 1569 p01-arm interact user017 PD 0:00 3 (Resources)
1571 p01-arm interact opr0019 PD 0:00 3 (Priority) 1571 p01-arm interact user017 PD 0:00 3 (Priority)
1568 p01-arm interact opr0019 R 1:52 8 p01-arm[01-08] 1568 p01-arm interact user017 R 1:52 8 p01-arm[01-08]
1566 p03-amd interact opr0019 R 1:55 1 p03-amd01 1566 p03-amd interact user017 R 1:55 1 p03-amd01
1567 p03-amd interact opr0019 R 1:53 1 p03-amd01 1567 p03-amd interact user017 R 1:53 1 p03-amd01
1570 p03-amd interact opr0019 R 0:26 1 p03-amd01 1570 p03-amd interact user017 R 0:26 1 p03-amd01
$ scancel --partition p03-amd $ scancel --partition p03-amd
$ squeue $ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1569 p01-arm interact opr0019 PD 0:00 3 (Resources) 1569 p01-arm interact user017 PD 0:00 3 (Resources)
1571 p01-arm interact opr0019 PD 0:00 3 (Priority) 1571 p01-arm interact user017 PD 0:00 3 (Priority)
1568 p01-arm interact opr0019 R 3:08 8 p01-arm[01-08] 1568 p01-arm interact user017 R 3:08 8 p01-arm[01-08]
1566 p03-amd interact opr0019 CA 2:42 1 p03-amd01 1566 p03-amd interact user017 CA 2:42 1 p03-amd01
1567 p03-amd interact opr0019 CA 2:40 1 p03-amd01 1567 p03-amd interact user017 CA 2:40 1 p03-amd01
1570 p03-amd interact opr0019 CA 1:13 1 p03-amd01 1570 p03-amd interact user017 CA 1:13 1 p03-amd01
$ scancel --partition p01-arm --state R $ scancel --partition p01-arm --state R
$ squeue $ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1568 p01-arm interact opr0019 CA 4:29 8 p01-arm[01-08] 1568 p01-arm interact user017 CA 4:29 8 p01-arm[01-08]
1569 p01-arm interact opr0019 R 0:07 3 p01-arm[01-03] 1569 p01-arm interact user017 R 0:07 3 p01-arm[01-03]
1571 p01-arm interact opr0019 R 0:07 3 p01-arm[04-06] 1571 p01-arm interact user017 R 0:07 3 p01-arm[04-06]
1566 p03-amd interact opr0019 CA 2:42 1 p03-amd01 1566 p03-amd interact user017 CA 2:42 1 p03-amd01
1567 p03-amd interact opr0019 CA 2:40 1 p03-amd01 1567 p03-amd interact user017 CA 2:40 1 p03-amd01
1570 p03-amd interact opr0019 CA 1:13 1 p03-amd01 1570 p03-amd interact user017 CA 1:13 1 p03-amd01
``` ```
For more information about the `scancel` command, its flags, and formatting options, see the manual, either by using the `man scancel` command or [online][g]. For more information about the `scancel` command, its flags, and formatting options, see the manual, either by using the `man scancel` command or [online][g].
...@@ -580,8 +593,8 @@ To view detailed job information, you can use `scontrol show job <job_id>`, for ...@@ -580,8 +593,8 @@ To view detailed job information, you can use `scontrol show job <job_id>`, for
```console ```console
$ scontrol show job 1571 $ scontrol show job 1571
UserId=opr0019(5856) GroupId=opr0019(6432) MCS_label=N/A UserId=user017(5856) GroupId=user017(6432) MCS_label=N/A
Priority=4294901692 Nice=0 Account=easybuild QOS=normal Priority=4294901692 Nice=0 Account=user018 QOS=normal
JobState=RUNNING Reason=None Dependency=(null) JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
DerivedExitCode=0:0 DerivedExitCode=0:0
...@@ -603,7 +616,7 @@ $ scontrol show job 1571 ...@@ -603,7 +616,7 @@ $ scontrol show job 1571
Features=(null) DelayBoot=00:00:00 Features=(null) DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=(null) Command=(null)
WorkDir=/home/opr0019 WorkDir=/home/user017
Power= Power=
``` ```
...@@ -615,24 +628,24 @@ Sometimes you may want to temporarily prevent your job from running. For this re ...@@ -615,24 +628,24 @@ Sometimes you may want to temporarily prevent your job from running. For this re
```console ```console
$ squeue $ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1572 p01-arm interact opr0019 PD 0:00 8 (Resources) 1572 p01-arm interact user017 PD 0:00 8 (Resources)
1576 p01-arm interact opr0019 PD 0:00 8 (Priority) 1576 p01-arm interact user017 PD 0:00 8 (Priority)
1569 p01-arm interact opr0019 R 1:14:11 3 p01-arm[01-03] 1569 p01-arm interact user017 R 1:14:11 3 p01-arm[01-03]
1571 p01-arm interact opr0019 R 1:14:11 3 p01-arm[04-06] 1571 p01-arm interact user017 R 1:14:11 3 p01-arm[04-06]
$ scontrol hold 1572 1576 $ scontrol hold 1572 1576
$ squeue $ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1576 p01-arm interact opr0019 PD 0:00 8 (JobHeldUser) 1576 p01-arm interact user017 PD 0:00 8 (JobHeldUser)
1572 p01-arm interact opr0019 PD 0:00 8 (JobHeldUser) 1572 p01-arm interact user017 PD 0:00 8 (JobHeldUser)
1569 p01-arm interact opr0019 R 1:14:32 3 p01-arm[01-03] 1569 p01-arm interact user017 R 1:14:32 3 p01-arm[01-03]
1571 p01-arm interact opr0019 R 1:14:32 3 p01-arm[04-06] 1571 p01-arm interact user017 R 1:14:32 3 p01-arm[04-06]
$ scontrol release 1572 1576 $ scontrol release 1572 1576
$ squeue $ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1572 p01-arm interact opr0019 PD 0:00 8 (Resources) 1572 p01-arm interact user017 PD 0:00 8 (Resources)
1576 p01-arm interact opr0019 PD 0:00 8 (Priority) 1576 p01-arm interact user017 PD 0:00 8 (Priority)
1569 p01-arm interact opr0019 R 1:14:39 3 p01-arm[01-03] 1569 p01-arm interact user017 R 1:14:39 3 p01-arm[01-03]
1571 p01-arm interact opr0019 R 1:14:39 3 p01-arm[04-06] 1571 p01-arm interact user017 R 1:14:39 3 p01-arm[04-06]
``` ```
`scontrol` also offers an interactive mode, where you can run commands in quick succession: `scontrol` also offers an interactive mode, where you can run commands in quick succession:
...@@ -683,9 +696,9 @@ which would result in 3 output files, by default called `slurm-${SLURM_ARRAY_JOB ...@@ -683,9 +696,9 @@ which would result in 3 output files, by default called `slurm-${SLURM_ARRAY_JOB
```console ```console
$ ls -l slurm*.out $ ls -l slurm*.out
-rw-rw-r-- 1 opr0019 opr0019 159 Feb 17 14:47 slurm-1632_1.out -rw-rw-r-- 1 user017 user017 159 Feb 17 14:47 slurm-1632_1.out
-rw-rw-r-- 1 opr0019 opr0019 159 Feb 17 14:47 slurm-1632_2.out -rw-rw-r-- 1 user017 user017 159 Feb 17 14:47 slurm-1632_2.out
-rw-rw-r-- 1 opr0019 opr0019 159 Feb 17 14:47 slurm-1632_3.out -rw-rw-r-- 1 user017 user017 159 Feb 17 14:47 slurm-1632_3.out
``` ```
with the following contents: with the following contents:
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment